HappycapyGuide

By Connie · Last reviewed: April 2026 — pricing & tools verified · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

Tutorial12 min read · April 5, 2026

RAG Explained: How Retrieval-Augmented Generation Works in 2026

TL;DR

  • What it is: RAG connects an LLM to a searchable knowledge base so it answers with retrieved facts, not hallucinated memory
  • Core pipeline: Chunk documents → embed to vectors → store in vector DB → retrieve on query → pass to LLM as context
  • Use RAG when: Knowledge changes frequently, you need citations, corpus is too large for context window
  • Best vector DBs: Pinecone (managed), Qdrant (self-hosted), pgvector (PostgreSQL teams)
  • Biggest mistake: Poor chunking strategy — it degrades retrieval more than any other factor

Retrieval-Augmented Generation (RAG) is the most widely deployed technique for grounding LLMs in real, up-to-date knowledge. It powers enterprise chatbots, customer support systems, legal research tools, and internal knowledge bases at thousands of companies.

This guide explains exactly how RAG works, when to use it, how to build a production-ready pipeline, and the critical decisions that determine whether your system actually works well.

The Problem RAG Solves

LLMs have two core limitations for real-world knowledge tasks:

Knowledge cutoff

Models are trained on data up to a specific date. They cannot know what happened after their training cutoff — no matter how large or capable they are.

Hallucination on specific facts

LLMs are excellent at language generation but unreliable at precise factual recall. They will confidently fabricate case numbers, statistics, product specs, and internal policy details that weren't prominent in training data.

RAG solves both problems by separating knowledge storage from language generation. The LLM generates language; the vector database stores facts. On every query, the relevant facts are fetched and handed to the model as context.

How RAG Works: The 5-Stage Pipeline

StageWhat HappensKey Decisions
1. IngestionLoad source documents (PDFs, web pages, databases, Notion, etc.)Document loaders, preprocessing (clean HTML, extract tables)
2. ChunkingSplit documents into retrieval-sized unitsChunk size, overlap, splitting strategy (fixed vs semantic)
3. EmbeddingConvert text chunks into numeric vectors using an embedding modelEmbedding model choice (OpenAI, Cohere, open-source)
4. RetrievalAt query time, embed the question, find nearest neighbor chunksTop-k, similarity threshold, hybrid search (vector + keyword)
5. GenerationPass retrieved chunks + question to LLM, generate grounded answerContext window size, system prompt, citation format

Chunking Strategy: The Highest-Impact Decision

Chunking is the step most developers get wrong. Poor chunking means relevant content gets split at bad boundaries, retrieval misses the right context, and your RAG system underperforms even with a great LLM.

StrategyHow It WorksBest ForChunk Size
Fixed-sizeSplit every N tokens with overlapUnstructured prose, quick prototypes512 tokens, 50 overlap
Semantic (paragraph)Split at paragraph or section breaksStructured docs, articles, reportsVariable, ~200–800 tokens
RecursiveSplit by hierarchy (section → paragraph → sentence)Long documents with nested structureLangChain default, adaptive
Agentic / Small-to-BigStore small chunks for retrieval, return parent chunks for generationHigh-precision retrieval + rich context128 retrieve / 1024 generate

Vector Databases Compared

DatabaseTypeHybrid SearchScaleBest For
PineconeManaged SaaSYesBillions of vectorsProduction, managed, fast setup
QdrantOpen source / managedYesHighSelf-hosted, advanced filtering
WeaviateOpen source / managedStrong BM25 + vectorHighTeams needing robust hybrid search
ChromaOpen sourceBasicSmall–mediumLocal dev, prototyping, LangChain default
pgvectorPostgreSQL extensionVia PostgreSQL full-textMediumTeams already on Postgres, minimize infra
MilvusOpen sourceYesVery high (billions)Massive-scale self-hosted deployments

Building a Basic RAG Pipeline (Python + Claude)

This example uses LangChain, Chroma, and Claude Sonnet 4.6 to build a simple document Q&A system:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings

# 1. Load document
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()

# 2. Chunk with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Retrieve on query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Generate with Claude
llm = ChatAnthropic(model="claude-sonnet-4-6")

def rag_query(question: str) -> str:
    # Retrieve relevant chunks
    relevant_docs = retriever.get_relevant_documents(question)
    context = "\n\n".join([doc.page_content for doc in relevant_docs])

    # Generate grounded answer
    prompt = f"""Answer the question using ONLY the provided context.
    If the answer is not in the context, say "I don't have that information."

    Context:
    {context}

    Question: {question}"""

    response = llm.invoke(prompt)
    return response.content

# Test
answer = rag_query("What is our parental leave policy?")
print(answer)

Advanced RAG Techniques in 2026

Hybrid Search (Vector + BM25)

Combine semantic similarity search with keyword-based BM25 search. Hybrid retrieval consistently outperforms pure vector search — keyword search catches exact term matches that embeddings sometimes miss (model names, product codes, proper nouns).

Reranking

After retrieving top-k chunks, pass them through a cross-encoder reranker (Cohere Rerank, BGE Reranker) to reorder by true relevance. Adds 50–200ms latency but significantly improves answer quality for ambiguous queries.

Query Expansion / HyDE

Hypothetical Document Embeddings (HyDE): ask the LLM to generate a hypothetical answer to the query, then embed that to find matching documents. Effective when user queries are short and imprecise but documents are long and detailed.

Agentic RAG

Give the LLM the ability to decide whether to retrieve, what to retrieve, and when to retrieve again. Instead of a fixed single-retrieval pipeline, the agent can issue multiple search queries, synthesize across sources, and request clarification. Claude Code and LangGraph support agentic RAG workflows.

GraphRAG

Microsoft GraphRAG builds a knowledge graph from documents, enabling retrieval that understands entity relationships rather than just text similarity. Best for complex enterprise knowledge bases where documents reference shared entities (people, projects, policies).

RAG vs. Fine-Tuning vs. Long Context: When to Use Each

ApproachBest WhenCostFreshnessCitations
RAGLarge corpus, changing data, citations neededMedium (infra)Real-timeYes — source docs
Fine-tuningSpecific tone/format, static knowledge, latency-criticalHigh (training)Training cutoffNo — baked in
Long contextSmall corpus (<500K tokens), no latency budgetHigh (tokens)Manual updatePartial
RAG + Fine-tuneProduction systems needing both quality and freshnessHighestReal-timeYes

Common RAG Failure Modes and Fixes

Failure ModeSymptomFix
Wrong chunks retrievedAnswers are plausible but miss the actual relevant contentImprove chunking, add hybrid search, use reranking
Context window overflowRetrieved chunks don't fit, LLM truncatesReduce chunk size, reduce top-k, use map-reduce summarization
Model ignores retrieved contextLLM answers from training memory, ignores provided docsStronger system prompt ("ONLY use the provided context"), explicitly instruct to cite
Stale embeddingsUpdated documents not reflected in retrievalImplement document versioning, re-embed on update, use timestamps as metadata filter
Semantic search misses exact termsFails on product codes, model numbers, namesAdd BM25 hybrid search (vector handles semantics, BM25 handles exact terms)

Build RAG Applications with HappyCapy

HappyCapy gives you access to Claude Sonnet 4.6 for building document Q&A systems, research tools, and knowledge base chatbots. Start with a free trial.

Try HappyCapy Free

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

RAG combines an LLM with an external knowledge retrieval system. It first searches a vector database for relevant documents, then passes those as context to the LLM to generate grounded answers with citations. It prevents hallucination by giving the model facts to reference rather than relying on parametric memory.

When should I use RAG vs. fine-tuning?

Use RAG when your knowledge changes frequently, you need source citations, or your corpus is too large for a context window. Use fine-tuning for static knowledge, specific tone/format, or latency-critical applications. Most production systems use both: RAG for knowledge, fine-tuning for style.

What vector database should I use for RAG?

Pinecone for managed production deployments. Qdrant for self-hosted with advanced filtering. Chroma for local development. pgvector if you're already on PostgreSQL. Weaviate for strong hybrid search. The best choice depends on your existing infrastructure and scale requirements.

What chunk size should I use for RAG?

Start with 512 tokens with 50-token overlap. Use smaller chunks (128–256 tokens) for precise factual retrieval. Use larger chunks (1024+ tokens) when context continuity matters. Always test empirically — chunk size has the highest impact on RAG quality of any single parameter.

Sources: Anthropic documentation, LangChain RAG documentation, Microsoft GraphRAG paper, Pinecone RAG guide, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020).

SharePost on XLinkedIn
Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

Comments