Retrieval-Augmented Generation ️
Ground models with your knowledge
The Knowledge Problem
Large language models are impressive, but they have limitations:
- Cutoff date: They only know information from training data
- Hallucinations: They confidently make up facts
- No private data: They can't access your documents
- No updates: They can't learn new information after training
Retrieval-Augmented Generation (RAG) solves these problems by giving the model access to external knowledge at query time.
How RAG Works
RAG combines two systems:
- Retriever: Finds relevant documents from a knowledge base
- Generator: Uses those documents to answer questions
When you ask a question:
- The retriever searches for relevant documents
- The most relevant chunks are added to the prompt
- The LLM generates an answer using this context
It's like giving someone a research assistant who finds relevant passages before they answer.
Building a Knowledge Base
First, you need to prepare your documents:
Chunking: Split documents into smaller pieces (typically 200-500 words). Too small = missing context. Too large = diluted relevance.
Embedding: Convert each chunk into a vector (list of numbers) that captures its meaning. Similar texts have similar vectors.
Indexing: Store vectors in a vector database for fast similarity search.
The Retrieval Process
When a query comes in:
- Embed the query: Convert it to a vector using the same embedding model
- Search: Find the k most similar document vectors
- Fetch: Retrieve the original text chunks
Popular vector databases: Pinecone, Weaviate, Chroma, Qdrant, Milvus
Context Construction
Retrieved chunks are formatted into a prompt:
Use the following context to answer the question.
Context:
[Document 1]: ...
[Document 2]: ...
[Document 3]: ...
Question: {user's question}
Answer:
The LLM then generates based on both the question and the provided context.
Why RAG Beats Fine-Tuning for Facts
Fine-tuning bakes knowledge into model weights:
- Slow to update
- Can degrade other capabilities
- No clear attribution
RAG provides knowledge at runtime:
- Instant updates (just change the documents)
- Doesn't affect core model behavior
- Can cite sources
Common RAG Patterns
Basic RAG: Query → Retrieve → Generate
Multi-step RAG: Query → Retrieve → Follow-up query → Retrieve more → Generate
Self-RAG: Model decides when and what to retrieve
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, then search for similar real documents
Challenges and Solutions
Challenge: Irrelevant results
- Solution: Reranking with a cross-encoder
- Solution: Better chunking strategies
Challenge: Missing context
- Solution: Retrieve more chunks
- Solution: Include parent/sibling chunks
Challenge: Conflicting information
- Solution: Date/version filtering
- Solution: Source prioritization
Challenge: User queries are vague
- Solution: Query rewriting
- Solution: Query expansion
Use Cases
- Enterprise search: Query internal documents
- Customer support: Answer questions from help docs
- Legal research: Find relevant precedents
- Medical: Ground answers in research papers
- Code assistance: Retrieve from documentation
References
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.