RICHES: A Novel Approach to Retrieval-Augmented Generation Through Unified Sequence Generation

Traditional RAG systems split retrieval and generation into separate components, creating complexity and limiting flexibility. RICHES eliminates this separation by interleaving document retrieval directly within sequence generation, enabling more sophisticated question-answering through a single model.

The Problem with Traditional RAG

Current RAG systems require two distinct components: a retriever that finds relevant documents and a generator that produces answers. This architecture creates several limitations:

Fixed pipeline: You cannot adapt the retrieval strategy based on partial generation
Single-hop limitation: Most systems retrieve documents once at the beginning
Complex integration: Maintaining separate retriever and generator models increases system complexity

How RICHES Works

RICHES unifies retrieval and generation by teaching language models to decode document contents directly from a corpus during text generation. The model learns to:

Generate retrieval tokens that specify which documents to access
Decode document contents constrained to the available corpus
Continue generation using the retrieved information
Plan next retrievals based on what it has generated so far

This approach enables multi-hop retrievals where the model can retrieve additional documents based on insights from previously retrieved content.

Key Advantages

Single Model Architecture: RICHES works with any instruction-tuned language model without additional training or separate retriever components.

Adaptive Retrieval: The model decides when and what to retrieve based on the generation context, enabling more sophisticated reasoning patterns.

Attribution Support: Since the model explicitly retrieves documents during generation, it naturally provides evidence attribution for its answers.

Multi-hop Capability: The model can perform multiple retrieval steps, using information from earlier retrievals to guide later ones.

Implementation Approach

RICHES operates through constrained decoding where the language model generates special tokens that trigger document retrieval. When the model needs information, it:

Generates a retrieval request token
Decodes the relevant document content from the corpus
Uses this information to continue generating the answer
Repeats as needed for complex questions

The constraint mechanism ensures the model can only decode actual document contents from the available corpus, maintaining factual accuracy.

Performance Results

Testing on open-domain question-answering tasks shows RICHES performs competitively with traditional RAG systems while offering greater flexibility. The unified approach particularly excels at:

Multi-hop questions requiring information from multiple sources
Attributed QA where evidence sources must be cited
Complex reasoning tasks benefiting from iterative retrieval

Getting Started

To implement RICHES, you need:

An instruction-tuned language model
A document corpus with indexing
Constrained decoding implementation
Prompts that teach the model when to retrieve

The approach requires no additional model training, making it accessible for teams already using instruction-tuned models.

Next Steps

RICHES represents a significant shift toward unified retrieval-generation architectures. Consider experimenting with this approach if you work on question-answering systems, need better attribution in your RAG pipeline, or want to enable multi-hop reasoning capabilities.

The unified architecture opens new possibilities for adaptive information retrieval that responds dynamically to generation context rather than following fixed retrieval patterns.

RICHES: A Novel Approach to Retrieval-Augmented Generation Through Unified Sequence Generation

RICHES: A Novel Approach to Retrieval-Augmented Generation Through Unified Sequence Generation

The Problem with Traditional RAG

How RICHES Works

Key Advantages

Implementation Approach

Performance Results

Getting Started

Next Steps

SpeCrawler: Automated OpenAPI Specification Generation from API Documentation Using Large Language Models

LLMDFA: Using Large Language Models for Compilation-Free Dataflow Analysis

Spinach: SPARQL-Based Information Navigation for Challenging Real-World Questions