RICHES: A Novel Approach to Retrieval-Augmented Generation Through Unified Sequence Generation

Traditional RAG systems split retrieval and generation into separate components, creating complexity and limiting flexibility. RICHES eliminates this separation by interleaving document retrieval directly within sequence generation, enabling more sophisticated question-answering through a single model.

The Problem with Traditional RAG

Current RAG systems require two distinct components: a retriever that finds relevant documents and a generator that produces answers. This architecture creates several limitations:

  • Fixed pipeline: You cannot adapt the retrieval strategy based on partial generation
  • Single-hop limitation: Most systems retrieve documents once at the beginning
  • Complex integration: Maintaining separate retriever and generator models increases system complexity

How RICHES Works

RICHES unifies retrieval and generation by teaching language models to decode document contents directly from a corpus during text generation. The model learns to:

  1. Generate retrieval tokens that specify which documents to access
  2. Decode document contents constrained to the available corpus
  3. Continue generation using the retrieved information
  4. Plan next retrievals based on what it has generated so far

This approach enables multi-hop retrievals where the model can retrieve additional documents based on insights from previously retrieved content.

Key Advantages

Single Model Architecture: RICHES works with any instruction-tuned language model without additional training or separate retriever components.

Adaptive Retrieval: The model decides when and what to retrieve based on the generation context, enabling more sophisticated reasoning patterns.

Attribution Support: Since the model explicitly retrieves documents during generation, it naturally provides evidence attribution for its answers.

Multi-hop Capability: The model can perform multiple retrieval steps, using information from earlier retrievals to guide later ones.

Implementation Approach

RICHES operates through constrained decoding where the language model generates special tokens that trigger document retrieval. When the model needs information, it:

  1. Generates a retrieval request token
  2. Decodes the relevant document content from the corpus
  3. Uses this information to continue generating the answer
  4. Repeats as needed for complex questions

The constraint mechanism ensures the model can only decode actual document contents from the available corpus, maintaining factual accuracy.

Performance Results

Testing on open-domain question-answering tasks shows RICHES performs competitively with traditional RAG systems while offering greater flexibility. The unified approach particularly excels at:

  • Multi-hop questions requiring information from multiple sources
  • Attributed QA where evidence sources must be cited
  • Complex reasoning tasks benefiting from iterative retrieval

Getting Started

To implement RICHES, you need:

  1. An instruction-tuned language model
  2. A document corpus with indexing
  3. Constrained decoding implementation
  4. Prompts that teach the model when to retrieve

The approach requires no additional model training, making it accessible for teams already using instruction-tuned models.

Next Steps

RICHES represents a significant shift toward unified retrieval-generation architectures. Consider experimenting with this approach if you work on question-answering systems, need better attribution in your RAG pipeline, or want to enable multi-hop reasoning capabilities.

The unified architecture opens new possibilities for adaptive information retrieval that responds dynamically to generation context rather than following fixed retrieval patterns.