Introduction to Retrieval-Augmented Generation (RAG) - Part 2

Details: Read Time: 9 mins; Created: 30 January 2025

In Part 1 of our series, we learned about the basic concepts of Retrieval-Augmented Generation (RAG) and saw how this framework functions similar to a digital library. We examined the three main components - Retriever, Ranker, and Generator - in detail and understood how they work together to generate precise and contextually relevant answers.

In this second part, we delve deeper into the technical aspects of RAG. We will look at how RAG is implemented in practice, what different model types exist, and how RAG-enhanced systems differ from traditional Large Language Models (LLMs).

The Two RAG Model Types: Sequence and Token

There are two fundamental approaches when implementing RAG: RAG-Sequence and RAG-Token. Each of these approaches has its specific strengths and is suitable for different use cases.

RAG-Sequence: The Holistic Approach

RAG-Sequence works like a careful author who first studies all relevant sources and then writes a coherent text. The mathematical elegance of this approach lies in its holistic perspective:

Or simplified: p(answer | question) = sum over all documents( p(document | question) × p(answer | question, document) )

In simple terms, this means:

The system calculates two values for each potentially relevant document:
- How likely is it that this document is relevant to the question?
- How likely is it that this document leads to the correct answer?
These probabilities are multiplied and summed across all documents to find the best answer.

The technical process proceeds as follows:

The system selects the top-K most relevant documents for the query from an index of millions of documents
The Generator creates a complete potential answer for each of these documents
These answers are merged based on their calculated probability

Research shows that this approach is particularly effective when it comes to creating coherent, well-structured answers. An interesting detail from the studies: The quality of answers continuously improves as more relevant documents are included.

Research results show that RAG-Sequence performs particularly well on tasks requiring consistent and coherent output. The answers are typically more diverse and nuanced than with other approaches.

This approach is particularly suitable for:

Summarizing longer texts
Creating reports
Answering complex questions that require comprehensive understanding
Tasks requiring thematic consistency throughout the entire text

RAG-Token: The Granular Approach

RAG-Token works like a conscientious journalist who consults the best possible source for every single sentence and word of their story. The mathematical formulation of this approach reflects this thoroughness:

pRAG-Token(y|x) ≈ Produkt( Summe( pη(z|x) * pθ(yi|x, z, y1:i-1) ) )

p(answer | question) = product for each word( sum over all documents( p(document | question) × p(word | question, document, previous words) ) )

In practice, this means:

For each individual word of the answer:
- The most relevant documents are re-evaluated
- The probability for each possible next word is calculated
- The most probable word is chosen based on all available information
This process repeats word by word, where each new word is influenced by:
- The words written so far
- The original question
- The currently most relevant documents

Research has shown that this approach delivers particularly precise results when it comes to factual accuracy. Interestingly, there is an optimal "sweet spot" in the number of documents considered: About 10 documents per token deliver the best results. More documents don't necessarily lead to better answers but can significantly increase processing time.

Studies have shown that RAG-Token is particularly effective for tasks requiring detailed, fact-based answers. The optimal number of retrieved documents, corresponding to the sweet spot, is typically around 10 documents - as more documents don't necessarily improve results but intensify costs.

This approach is optimal for:

Detailed technical explanations
Fact-based answers
Situations requiring highly specific information
Cases where different parts of the answer require different sources

The Technical Implementation: The RAG Pipeline

The practical implementation of a RAG system occurs in three main phases: Ingestion, Retrieval, and Generation. Each of these phases plays a crucial role in the quality of the final results.

1. The Ingestion Phase

In this first phase, documents are fed into the system and prepared for later use. The original RAG implementation used a Wikipedia dump with 21 million documents, each split into 100-word chunks. The process proceeds as follows:

RAG Ingestion

Document Preparation:
- Documents are split into smaller, manageable pieces (chunks)
- The chunk size is optimized based on the use case
Embedding Generation:
- Each chunk is converted into a mathematical, multidimensional vector by a BERT-based document encoder (typically up to 1024-dimensional)
- These vectors capture the semantics of the text, meaning not the wording but the meaning (we'll learn more about vectors in a future article about vector databases)
Indexing:
- The embeddings are stored in a vector database
- A MIPS index (Maximum Inner Product Search) is created for efficient search
- Technologies like FAISS enable fast, approximate search in sublinear time

This process is crucial for the system's later efficiency. The art lies in splitting the documents so that coherent information isn't torn apart, while keeping the chunks small enough to deliver precise results.

2. The Retrieval Phase

When a query arrives, the retrieval phase begins:

Query Processing: The query is also converted into a vector. This query is both stored in the vector database by the Retriever and sent directly in its original form to the Generator for answer synthesis
Similarity Search: The system searches the vector database (Index) for documents similar to those transmitted to the database. Here, the query vectors are compared with those of the indexed documents
Top-K Selection: The most relevant documents are selected

The number of selected documents (Top-K) is an important parameter that influences the balance between completeness and processing efficiency.

3. The Generation Phase

In the final phase, the found information is processed into an answer:

Context Integration: The selected Top-K documents are combined with the original query
Answer Generation: The LLM creates a coherent answer
Quality Assurance: The answer is checked for consistency and relevance before being output

A Comprehensible Example

Step 1: Creating a Prompt (Query)

Imagine an employee approaching your desk and asking you a question:

"We had a project idea. We want to do xyz. What are the legal guidelines for this?"

This inquiry is a prompt.

Now put yourself in the position of artificial intelligence. The question is clear and direct. But to answer it, you need more context. So you search through all your memory of previous conversations with this employee and thus establish a better reference to the topic, which will lead to a better and more comprehensive understanding of the question.

All the information found is the context. The prompt is now expanded with the context - the historical information is, so to speak, attached to the employee's actual question. The prompt now becomes what is known as an Enhanced Prompt.

As a supervisor, you also consider how exactly the project idea benefits your company. So you think about it some more, maybe ask again, and further supplement the Enhanced Prompt with the insights found. Gradually, the Query is thus completed.

Step 2: Embedding

In the next step, you feed your query into an Embedding Model, which converts it into a 1024-dimensional vector. So not just three-dimensional, but a really unimaginable value for normal people. Through these 1024 dimensions, it is then possible to capture the meaning of the Query fairly accurately. So not the individual words, but the meaning. The Embedding Model now stores this vector in the vector database.

Now the vector database performs a similarity search to the vector of the Query. Found similarity of a document means that it is bookmarked. At the end of the search, the best results are then selected and the corresponding documents, database entries, etc. are referenced.

These found documents and other information are now attached to the Query. Think of it as if you were attaching documents to an already formulated email. This transforms the Enhanced Prompt into an Augmented Prompt.

Step 3: Generating an Answer

This Augmented Prompt now goes together with the original Query to the Generator, more specifically the involved LLM. The LLM now synthesizes an answer from the original query and all the additional information found. This Response then goes back to the employee who is hopefully satisfied with the content.

The RAG Triad: Evaluating Quality

Even though an LLM through RAG has access to the most current information in the information pool and can give very accurate answers, this does not eliminate the risk of hallucinations. There are several reasons for this:

The Retriever simply doesn't gather enough context (quantitatively), or it gathers wrong information (qualitatively).
Perhaps the answer isn't fully supported by the gathered context but was too heavily influenced by the LLM and the training data.
A RAG might gather the relevant information and build a fundamentally correct answer from it, but then the answer might not fit the actual question.

To avoid this, the RAG Triad was developed.

RAG Triad

To evaluate the quality of a RAG system, we consider three central aspects:

1. Context Relevance

How well do the retrieved documents match the query?
Was the right context found?
Is the information current and appropriate?

2. Groundedness

Is the answer actually based on the retrieved documents?
Are statements supported by the sources?
Are there hallucinations or unfounded conclusions?

3. Answer Relevance

Does the generated answer address the original question?
Is the answer complete and precise?
Is the depth of information appropriate?

When applying these three aspects again as parameters to the query, context, and generated answer, one can minimize hallucinations and build more reliable and robust RAG-based applications.

LLMs with and without RAG: A Comparison

Traditional LLMs and RAG-enhanced systems fundamentally differ in their way of working:

Traditional LLMs

Are based exclusively on pre-trained knowledge
Cannot access new or specific information
Are more susceptible to hallucinations
Have a "frozen" state of knowledge

RAG-enhanced Systems

Combine pre-trained knowledge with external sources
Can access current and specific information
Deliver traceable, source-based answers
Are flexibly expandable

These differences make RAG an ideal solution for enterprise applications where precision, currency, and traceability are crucial.

Ralf Ramge

Founder, Cloud Architect & IT Consultant

Terraform @ Scale - Part 4a: Data Sources are Dangerous!

Terraform @ Scale - Part 3c: Monitoring and Alerting for Blast Radius Events

HashiCorp Vault Deep Dive – Part 2b: Practical Work with the Key/Value Secrets Engine

Terraform @Scale - Part 3b: Blast Radius Recovery Strategies

HashiCorp Vault Deep Dive - Part 2a: Activating the Key/Value Secrets Engine

Terraform @ Scale - Part 3a: Blast-Radius Management

HashiCorp Vault Deep Dive - Part 1: Fundamentals of Secret Engines

Terraform @ Scale - Part 2: The Art of Optimal State Sizing

Terraform @ Scale - Part 1e: Scaling Across Organizational Boundaries

Keeping IT Risks Under Control – Before Your Company Faces a Crisis