RAG Explained: How Retrieval-Augmented Generation Works in 2026

TL;DR

What it is: RAG connects an LLM to a searchable knowledge base so it answers with retrieved facts, not hallucinated memory
Core pipeline: Chunk documents → embed to vectors → store in vector DB → retrieve on query → pass to LLM as context
Use RAG when: Knowledge changes frequently, you need citations, corpus is too large for context window
Best vector DBs: Pinecone (managed), Qdrant (self-hosted), pgvector (PostgreSQL teams)
Biggest mistake: Poor chunking strategy — it degrades retrieval more than any other factor

Retrieval-Augmented Generation (RAG) is the most widely deployed technique for grounding LLMs in real, up-to-date knowledge. It powers enterprise chatbots, customer support systems, legal research tools, and internal knowledge bases at thousands of companies.

This guide explains exactly how RAG works, when to use it, how to build a production-ready pipeline, and the critical decisions that determine whether your system actually works well.

The Problem RAG Solves

LLMs have two core limitations for real-world knowledge tasks:

Knowledge cutoff

Models are trained on data up to a specific date. They cannot know what happened after their training cutoff — no matter how large or capable they are.

Hallucination on specific facts

LLMs are excellent at language generation but unreliable at precise factual recall. They will confidently fabricate case numbers, statistics, product specs, and internal policy details that weren't prominent in training data.

RAG solves both problems by separating knowledge storage from language generation. The LLM generates language; the vector database stores facts. On every query, the relevant facts are fetched and handed to the model as context.

How RAG Works: The 5-Stage Pipeline

Stage	What Happens	Key Decisions
1. Ingestion	Load source documents (PDFs, web pages, databases, Notion, etc.)	Document loaders, preprocessing (clean HTML, extract tables)
2. Chunking	Split documents into retrieval-sized units	Chunk size, overlap, splitting strategy (fixed vs semantic)
3. Embedding	Convert text chunks into numeric vectors using an embedding model	Embedding model choice (OpenAI, Cohere, open-source)
4. Retrieval	At query time, embed the question, find nearest neighbor chunks	Top-k, similarity threshold, hybrid search (vector + keyword)
5. Generation	Pass retrieved chunks + question to LLM, generate grounded answer	Context window size, system prompt, citation format

Chunking Strategy: The Highest-Impact Decision

Chunking is the step most developers get wrong. Poor chunking means relevant content gets split at bad boundaries, retrieval misses the right context, and your RAG system underperforms even with a great LLM.

Strategy	How It Works	Best For	Chunk Size
Fixed-size	Split every N tokens with overlap	Unstructured prose, quick prototypes	512 tokens, 50 overlap
Semantic (paragraph)	Split at paragraph or section breaks	Structured docs, articles, reports	Variable, ~200–800 tokens
Recursive	Split by hierarchy (section → paragraph → sentence)	Long documents with nested structure	LangChain default, adaptive
Agentic / Small-to-Big	Store small chunks for retrieval, return parent chunks for generation	High-precision retrieval + rich context	128 retrieve / 1024 generate

Vector Databases Compared

Database	Type	Hybrid Search	Scale	Best For
Pinecone	Managed SaaS	Yes	Billions of vectors	Production, managed, fast setup
Qdrant	Open source / managed	Yes	High	Self-hosted, advanced filtering
Weaviate	Open source / managed	Strong BM25 + vector	High	Teams needing robust hybrid search
Chroma	Open source	Basic	Small–medium	Local dev, prototyping, LangChain default
pgvector	PostgreSQL extension	Via PostgreSQL full-text	Medium	Teams already on Postgres, minimize infra
Milvus	Open source	Yes	Very high (billions)	Massive-scale self-hosted deployments

Building a Basic RAG Pipeline (Python + Claude)

This example uses LangChain, Chroma, and Claude Sonnet 4.6 to build a simple document Q&A system:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings

# 1. Load document
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()

# 2. Chunk with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Retrieve on query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Generate with Claude
llm = ChatAnthropic(model="claude-sonnet-4-6")

def rag_query(question: str) -> str:
    # Retrieve relevant chunks
    relevant_docs = retriever.get_relevant_documents(question)
    context = "\n\n".join([doc.page_content for doc in relevant_docs])

    # Generate grounded answer
    prompt = f"""Answer the question using ONLY the provided context.
    If the answer is not in the context, say "I don't have that information."

    Context:
    {context}

    Question: {question}"""

    response = llm.invoke(prompt)
    return response.content

# Test
answer = rag_query("What is our parental leave policy?")
print(answer)

Advanced RAG Techniques in 2026

Hybrid Search (Vector + BM25)

Combine semantic similarity search with keyword-based BM25 search. Hybrid retrieval consistently outperforms pure vector search — keyword search catches exact term matches that embeddings sometimes miss (model names, product codes, proper nouns).

Reranking

After retrieving top-k chunks, pass them through a cross-encoder reranker (Cohere Rerank, BGE Reranker) to reorder by true relevance. Adds 50–200ms latency but significantly improves answer quality for ambiguous queries.

Query Expansion / HyDE

Hypothetical Document Embeddings (HyDE): ask the LLM to generate a hypothetical answer to the query, then embed that to find matching documents. Effective when user queries are short and imprecise but documents are long and detailed.

Agentic RAG

Give the LLM the ability to decide whether to retrieve, what to retrieve, and when to retrieve again. Instead of a fixed single-retrieval pipeline, the agent can issue multiple search queries, synthesize across sources, and request clarification. Claude Code and LangGraph support agentic RAG workflows.

GraphRAG

Microsoft GraphRAG builds a knowledge graph from documents, enabling retrieval that understands entity relationships rather than just text similarity. Best for complex enterprise knowledge bases where documents reference shared entities (people, projects, policies).

RAG vs. Fine-Tuning vs. Long Context: When to Use Each

Approach	Best When	Cost	Freshness	Citations
RAG	Large corpus, changing data, citations needed	Medium (infra)	Real-time	Yes — source docs
Fine-tuning	Specific tone/format, static knowledge, latency-critical	High (training)	Training cutoff	No — baked in
Long context	Small corpus (<500K tokens), no latency budget	High (tokens)	Manual update	Partial
RAG + Fine-tune	Production systems needing both quality and freshness	Highest	Real-time	Yes

Common RAG Failure Modes and Fixes

Failure Mode	Symptom	Fix
Wrong chunks retrieved	Answers are plausible but miss the actual relevant content	Improve chunking, add hybrid search, use reranking
Context window overflow	Retrieved chunks don't fit, LLM truncates	Reduce chunk size, reduce top-k, use map-reduce summarization
Model ignores retrieved context	LLM answers from training memory, ignores provided docs	Stronger system prompt ("ONLY use the provided context"), explicitly instruct to cite
Stale embeddings	Updated documents not reflected in retrieval	Implement document versioning, re-embed on update, use timestamps as metadata filter
Semantic search misses exact terms	Fails on product codes, model numbers, names	Add BM25 hybrid search (vector handles semantics, BM25 handles exact terms)

Build RAG Applications with Happycapy

Happycapy gives you access to Claude Sonnet 4.6 for building document Q&A systems, research tools, and knowledge base chatbots. Start with a free trial.

Try Happycapy Free

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

RAG combines an LLM with an external knowledge retrieval system. It first searches a vector database for relevant documents, then passes those as context to the LLM to generate grounded answers with citations. It prevents hallucination by giving the model facts to reference rather than relying on parametric memory.

When should I use RAG vs. fine-tuning?

Use RAG when your knowledge changes frequently, you need source citations, or your corpus is too large for a context window. Use fine-tuning for static knowledge, specific tone/format, or latency-critical applications. Most production systems use both: RAG for knowledge, fine-tuning for style.

What vector database should I use for RAG?

Pinecone for managed production deployments. Qdrant for self-hosted with advanced filtering. Chroma for local development. pgvector if you're already on PostgreSQL. Weaviate for strong hybrid search. The best choice depends on your existing infrastructure and scale requirements.

What chunk size should I use for RAG?

Start with 512 tokens with 50-token overlap. Use smaller chunks (128–256 tokens) for precise factual retrieval. Use larger chunks (1024+ tokens) when context continuity matters. Always test empirically — chunk size has the highest impact on RAG quality of any single parameter.

Sources: Anthropic documentation, LangChain RAG documentation, Microsoft GraphRAG paper, Pinecone RAG guide, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020).

Sources

OpenAI Anthropic Anthropic Claude Microsoft

← Back to all articles