Retrieval-Augmented Generation ️

Ground models with your knowledge

The Knowledge Problem

Large language models are impressive, but they have limitations:

Cutoff date: They only know information from training data
Hallucinations: They confidently make up facts
No private data: They can't access your documents
No updates: They can't learn new information after training

Retrieval-Augmented Generation (RAG) solves these problems by giving the model access to external knowledge at query time.

How RAG Works

RAG combines two systems:

Retriever: Finds relevant documents from a knowledge base
Generator: Uses those documents to answer questions

When you ask a question:

The retriever searches for relevant documents
The most relevant chunks are added to the prompt
The LLM generates an answer using this context

It's like giving someone a research assistant who finds relevant passages before they answer.

Building a Knowledge Base

First, you need to prepare your documents:

Chunking: Split documents into smaller pieces (typically 200-500 words). Too small = missing context. Too large = diluted relevance.

Embedding: Convert each chunk into a vector (list of numbers) that captures its meaning. Similar texts have similar vectors.

Indexing: Store vectors in a vector database for fast similarity search.

The Retrieval Process

When a query comes in:

Embed the query: Convert it to a vector using the same embedding model
Search: Find the k most similar document vectors
Fetch: Retrieve the original text chunks

Popular vector databases: Pinecone, Weaviate, Chroma, Qdrant, Milvus

Context Construction

Retrieved chunks are formatted into a prompt:

Use the following context to answer the question.

Context:
[Document 1]: ...
[Document 2]: ...
[Document 3]: ...

Question: {user's question}

Answer:

The LLM then generates based on both the question and the provided context.

Why RAG Beats Fine-Tuning for Facts

Fine-tuning bakes knowledge into model weights:

Slow to update
Can degrade other capabilities
No clear attribution

RAG provides knowledge at runtime:

Instant updates (just change the documents)
Doesn't affect core model behavior
Can cite sources

Common RAG Patterns

Basic RAG: Query → Retrieve → Generate

Multi-step RAG: Query → Retrieve → Follow-up query → Retrieve more → Generate

Self-RAG: Model decides when and what to retrieve

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, then search for similar real documents

Challenges and Solutions

Challenge: Irrelevant results

Solution: Reranking with a cross-encoder
Solution: Better chunking strategies

Challenge: Missing context

Solution: Retrieve more chunks
Solution: Include parent/sibling chunks

Challenge: Conflicting information

Solution: Date/version filtering
Solution: Source prioritization

Challenge: User queries are vague

Solution: Query rewriting
Solution: Query expansion

Use Cases

Enterprise search: Query internal documents
Customer support: Answer questions from help docs
Legal research: Find relevant precedents
Medical: Ground answers in research papers
Code assistance: Retrieve from documentation

References

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn