RAG Explained: How Retrieval-Augmented Generation Works in 2026
TL;DR
- What it is: RAG connects an LLM to a searchable knowledge base so it answers with retrieved facts, not hallucinated memory
- Core pipeline: Chunk documents → embed to vectors → store in vector DB → retrieve on query → pass to LLM as context
- Use RAG when: Knowledge changes frequently, you need citations, corpus is too large for context window
- Best vector DBs: Pinecone (managed), Qdrant (self-hosted), pgvector (PostgreSQL teams)
- Biggest mistake: Poor chunking strategy — it degrades retrieval more than any other factor
Retrieval-Augmented Generation (RAG) is the most widely deployed technique for grounding LLMs in real, up-to-date knowledge. It powers enterprise chatbots, customer support systems, legal research tools, and internal knowledge bases at thousands of companies.
This guide explains exactly how RAG works, when to use it, how to build a production-ready pipeline, and the critical decisions that determine whether your system actually works well.
The Problem RAG Solves
LLMs have two core limitations for real-world knowledge tasks:
Knowledge cutoff
Models are trained on data up to a specific date. They cannot know what happened after their training cutoff — no matter how large or capable they are.
Hallucination on specific facts
LLMs are excellent at language generation but unreliable at precise factual recall. They will confidently fabricate case numbers, statistics, product specs, and internal policy details that weren't prominent in training data.
RAG solves both problems by separating knowledge storage from language generation. The LLM generates language; the vector database stores facts. On every query, the relevant facts are fetched and handed to the model as context.
How RAG Works: The 5-Stage Pipeline
| Stage | What Happens | Key Decisions |
|---|---|---|
| 1. Ingestion | Load source documents (PDFs, web pages, databases, Notion, etc.) | Document loaders, preprocessing (clean HTML, extract tables) |
| 2. Chunking | Split documents into retrieval-sized units | Chunk size, overlap, splitting strategy (fixed vs semantic) |
| 3. Embedding | Convert text chunks into numeric vectors using an embedding model | Embedding model choice (OpenAI, Cohere, open-source) |
| 4. Retrieval | At query time, embed the question, find nearest neighbor chunks | Top-k, similarity threshold, hybrid search (vector + keyword) |
| 5. Generation | Pass retrieved chunks + question to LLM, generate grounded answer | Context window size, system prompt, citation format |
Chunking Strategy: The Highest-Impact Decision
Chunking is the step most developers get wrong. Poor chunking means relevant content gets split at bad boundaries, retrieval misses the right context, and your RAG system underperforms even with a great LLM.
| Strategy | How It Works | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split every N tokens with overlap | Unstructured prose, quick prototypes | 512 tokens, 50 overlap |
| Semantic (paragraph) | Split at paragraph or section breaks | Structured docs, articles, reports | Variable, ~200–800 tokens |
| Recursive | Split by hierarchy (section → paragraph → sentence) | Long documents with nested structure | LangChain default, adaptive |
| Agentic / Small-to-Big | Store small chunks for retrieval, return parent chunks for generation | High-precision retrieval + rich context | 128 retrieve / 1024 generate |
Vector Databases Compared
| Database | Type | Hybrid Search | Scale | Best For |
|---|---|---|---|---|
| Pinecone | Managed SaaS | Yes | Billions of vectors | Production, managed, fast setup |
| Qdrant | Open source / managed | Yes | High | Self-hosted, advanced filtering |
| Weaviate | Open source / managed | Strong BM25 + vector | High | Teams needing robust hybrid search |
| Chroma | Open source | Basic | Small–medium | Local dev, prototyping, LangChain default |
| pgvector | PostgreSQL extension | Via PostgreSQL full-text | Medium | Teams already on Postgres, minimize infra |
| Milvus | Open source | Yes | Very high (billions) | Massive-scale self-hosted deployments |
Building a Basic RAG Pipeline (Python + Claude)
This example uses LangChain, Chroma, and Claude Sonnet 4.6 to build a simple document Q&A system:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings
# 1. Load document
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()
# 2. Chunk with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_documents(docs)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Retrieve on query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 5. Generate with Claude
llm = ChatAnthropic(model="claude-sonnet-4-6")
def rag_query(question: str) -> str:
# Retrieve relevant chunks
relevant_docs = retriever.get_relevant_documents(question)
context = "\n\n".join([doc.page_content for doc in relevant_docs])
# Generate grounded answer
prompt = f"""Answer the question using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Context:
{context}
Question: {question}"""
response = llm.invoke(prompt)
return response.content
# Test
answer = rag_query("What is our parental leave policy?")
print(answer)Advanced RAG Techniques in 2026
Hybrid Search (Vector + BM25)
Combine semantic similarity search with keyword-based BM25 search. Hybrid retrieval consistently outperforms pure vector search — keyword search catches exact term matches that embeddings sometimes miss (model names, product codes, proper nouns).
Reranking
After retrieving top-k chunks, pass them through a cross-encoder reranker (Cohere Rerank, BGE Reranker) to reorder by true relevance. Adds 50–200ms latency but significantly improves answer quality for ambiguous queries.
Query Expansion / HyDE
Hypothetical Document Embeddings (HyDE): ask the LLM to generate a hypothetical answer to the query, then embed that to find matching documents. Effective when user queries are short and imprecise but documents are long and detailed.
Agentic RAG
Give the LLM the ability to decide whether to retrieve, what to retrieve, and when to retrieve again. Instead of a fixed single-retrieval pipeline, the agent can issue multiple search queries, synthesize across sources, and request clarification. Claude Code and LangGraph support agentic RAG workflows.
GraphRAG
Microsoft GraphRAG builds a knowledge graph from documents, enabling retrieval that understands entity relationships rather than just text similarity. Best for complex enterprise knowledge bases where documents reference shared entities (people, projects, policies).
RAG vs. Fine-Tuning vs. Long Context: When to Use Each
| Approach | Best When | Cost | Freshness | Citations |
|---|---|---|---|---|
| RAG | Large corpus, changing data, citations needed | Medium (infra) | Real-time | Yes — source docs |
| Fine-tuning | Specific tone/format, static knowledge, latency-critical | High (training) | Training cutoff | No — baked in |
| Long context | Small corpus (<500K tokens), no latency budget | High (tokens) | Manual update | Partial |
| RAG + Fine-tune | Production systems needing both quality and freshness | Highest | Real-time | Yes |
Common RAG Failure Modes and Fixes
| Failure Mode | Symptom | Fix |
|---|---|---|
| Wrong chunks retrieved | Answers are plausible but miss the actual relevant content | Improve chunking, add hybrid search, use reranking |
| Context window overflow | Retrieved chunks don't fit, LLM truncates | Reduce chunk size, reduce top-k, use map-reduce summarization |
| Model ignores retrieved context | LLM answers from training memory, ignores provided docs | Stronger system prompt ("ONLY use the provided context"), explicitly instruct to cite |
| Stale embeddings | Updated documents not reflected in retrieval | Implement document versioning, re-embed on update, use timestamps as metadata filter |
| Semantic search misses exact terms | Fails on product codes, model numbers, names | Add BM25 hybrid search (vector handles semantics, BM25 handles exact terms) |
Build RAG Applications with HappyCapy
HappyCapy gives you access to Claude Sonnet 4.6 for building document Q&A systems, research tools, and knowledge base chatbots. Start with a free trial.
Try HappyCapy FreeFrequently Asked Questions
What is RAG (Retrieval-Augmented Generation)?
RAG combines an LLM with an external knowledge retrieval system. It first searches a vector database for relevant documents, then passes those as context to the LLM to generate grounded answers with citations. It prevents hallucination by giving the model facts to reference rather than relying on parametric memory.
When should I use RAG vs. fine-tuning?
Use RAG when your knowledge changes frequently, you need source citations, or your corpus is too large for a context window. Use fine-tuning for static knowledge, specific tone/format, or latency-critical applications. Most production systems use both: RAG for knowledge, fine-tuning for style.
What vector database should I use for RAG?
Pinecone for managed production deployments. Qdrant for self-hosted with advanced filtering. Chroma for local development. pgvector if you're already on PostgreSQL. Weaviate for strong hybrid search. The best choice depends on your existing infrastructure and scale requirements.
What chunk size should I use for RAG?
Start with 512 tokens with 50-token overlap. Use smaller chunks (128–256 tokens) for precise factual retrieval. Use larger chunks (1024+ tokens) when context continuity matters. Always test empirically — chunk size has the highest impact on RAG quality of any single parameter.
Sources: Anthropic documentation, LangChain RAG documentation, Microsoft GraphRAG paper, Pinecone RAG guide, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020).