Keyword Search is All You Need: Achieving RAG-Level Performance Without Vector Databases
Researchers at Amazon Web Services have discovered that simple keyword search tools can match the performance of complex vector database systems in document question-answering tasks. Their study shows that agentic keyword search achieves over 90% of traditional RAG performance while being simpler and more cost-effective.
The Problem with Traditional RAG
Retrieval-Augmented Generation (RAG) systems combine large language models with external knowledge bases to reduce hallucinations and improve factual accuracy. However, RAG presents significant challenges:
- High maintenance overhead: Vector databases require frequent updates and substantial infrastructure
- Integration complexity: Setting up and maintaining embeddings, chunking strategies, and retrieval pipelines
- Cost burden: Especially problematic for organizations with rapidly changing knowledge bases
The Agentic Alternative
The research team developed an agent-based approach that uses basic Linux command-line tools instead of vector databases. Their system leverages:
- PDF metadata analysis: Understanding document structure before searching
- RipGrep-All (rga): Regex-based pattern matching across multiple file types
- PDFGrep: PDF-specific search with page-range targeting
- Iterative refinement: Agents modify search strategies based on results
The agent follows a simple workflow: analyze available documents, perform broad keyword searches, then use targeted searches with error handling and automatic retry mechanisms.
Experimental Results
Testing across six diverse datasets revealed impressive performance:
Average Attainment Scores (vs. RAG baseline):
- Faithfulness: 94.52%
- Context Recall: 88.05%
- Answer Correctness: 91.48%
Standout Performance:
- BlockchainSolana dataset: 99.97% answer correctness
- LLM Survey paper: 99.51% answer correctness
- FinanceBench dataset: 6 percentage point improvement over traditional RAG
The keyword search approach performed particularly well on technical documentation and complex financial documents, where active search capabilities outperformed static chunk-based retrieval.
Implementation Advantages
The agentic approach offers several practical benefits:
Simplicity: No vector database setup or embedding model management required
Cost-effectiveness: Eliminates infrastructure costs for maintaining large-scale vector stores
Flexibility: Adapts to new document types without retraining or knowledge base updates
Real-time capability: Searches current documents without preprocessing delays
Limitations and Considerations
The research identified several constraints:
- Large document performance: Degradation with very large files
- Context window limits: Bounded by LLM token constraints
- Multimedia handling: Limited to text-based content
- Contextual nuance: May miss subtle semantic relationships that embeddings capture
When to Choose Keyword Search
This approach works best for:
- Frequently updated knowledge bases: Where vector database maintenance becomes burdensome
- Resource-constrained environments: Where infrastructure costs matter
- Technical documentation: Where precise term matching is crucial
- Rapid prototyping: When you need quick results without complex setup
Implementation Guide
To implement this approach:
- Set up agent framework: Use LangChain with ReAct reasoning
- Configure search tools: Install rga, pdfgrep, and metadata extraction scripts
- Design search strategy: Start with metadata analysis, then iterative keyword searches
- Add error handling: Implement retry mechanisms for failed searches
- Optimize context extraction: Use surrounding text capture (-C flag) for better context
The Bottom Line
This research challenges the assumption that vector databases are essential for high-quality document retrieval. For many applications, especially those requiring frequent updates or operating under resource constraints, agentic keyword search provides a compelling alternative that’s both simpler to implement and maintain.
The 90%+ performance achievement suggests that semantic search may be less critical than previously thought for many document QA tasks. Consider starting with keyword search for your next RAG project—you might find it’s all you need.