Retrieval-Augmented Generation (RAG)

Learn how RAG lets an LLM answer questions using relevant external documents fetched at query time.

Difficulty intermediate
Read time 8 min
rag retrieval llm embeddings vector-search vector-database knowledge-grounding nlp
Updated February 8, 2026

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a way to make a language model feel like it “looked something up” before answering.

Analogy: imagine an extremely smart colleague with a decent memory… but who also has access to a well-organized library. When you ask a question, they first walk to the right shelf, grab a few relevant pages, skim them, and then answer—using those pages as evidence. That “walk to the shelf” step is retrieval; the final response is generation.

Technically, RAG is a pattern where an LLM (the generator) is given retrieved context (snippets from documents, databases, or knowledge bases) at inference time, so the answer can be based on your data instead of only what the model memorized during training.

Why Does It Matter?

LLMs are impressive, but they have two chronic limitations:

  1. Their built-in knowledge is frozen at training time.
  2. They can’t read your entire data source (wikis, PDFs, policies, tickets, codebases) in one go because of context window limits.

RAG matters because it helps you build systems that:

  • Answer questions using current and private information (company docs, product manuals, internal policies).
  • Provide traceability: you can show the user what text the model used, which is a big deal for trust and auditing.
  • Update knowledge without retraining the model—just update the document store and re-index.

In practice, RAG is the backbone of many “chat with your data” applications and internal copilots.

How It Works

A clean mental model: Index first, then retrieve, then generate.

1) Prepare the knowledge (ingestion)

You start with your source material: PDFs, docs, web pages, tickets, code, etc.

Typical steps:

  • Clean text (remove boilerplate, fix encoding).
  • Chunk into pieces (e.g., 200–800 tokens). Chunking matters because retrieval works better on smaller, focused passages, and the model can’t ingest infinite text anyway.

Example: A 40-page HR policy becomes ~200 chunks like “Vacation policy — carryover rules”, “Sick leave — documentation”, etc.

2) Turn chunks into vectors (embeddings)

Each chunk is converted into an embedding: a list of numbers that represents meaning (roughly: “what this text is about”).

Intuition: embeddings place text into a “meaning-space” so that similar ideas land near each other, even if the wording differs.

3) Store them in an index (vector database / vector index)

You store embeddings (and their original text) in a vector index. This index supports fast “find the most similar vectors” queries.

4) At question time, retrieve relevant chunks

When a user asks a question:

  1. Embed the question.
  2. Search the vector index for the top-k most similar chunks (e.g., k=5).
  3. Optionally re-rank results with a stronger model (common in higher-quality RAG systems).

A common similarity measure is cosine similarity:

sim(q,d)=qdqd\text{sim}(q, d) = \frac{q \cdot d}{|q||d|}

Intuition before math: treat embeddings like arrows in space; cosine similarity measures how closely those arrows point in the same direction. The closer the directions, the more semantically related the texts likely are.

5) Assemble a prompt and generate the answer

You create a prompt like:

  • System instructions (“Answer using the provided context. If missing, say you don’t know.”)
  • The user question
  • The retrieved chunks (often called “context”)

Then the LLM generates an answer conditioned on those chunks. This is the “augmented” part: generation is now anchored in retrieved evidence.

A tiny concrete walkthrough

Suppose your internal doc says:

“Refunds are available within 30 days for unopened items. Opened items are eligible only for store credit.”

User asks: “Can I get my money back if I opened it?”

Retrieval finds the refund chunk. The model answers:

“Opened items aren’t eligible for a cash refund; they’re eligible for store credit.”

Without RAG, the model might guess based on generic e-commerce norms. With RAG, it can align to your policy.

Key Terminology

  • Embedding: A numeric representation of text (or images) capturing semantic meaning, used to compare similarity.
  • Chunking: Splitting documents into smaller passages to improve retrieval quality and fit context limits.
  • Retriever: The component that selects relevant chunks for a query (often via vector similarity search).
  • Vector index / vector database: A data structure/system optimized to store embeddings and quickly return nearest neighbors.
  • Grounding (and provenance): Using retrieved sources to anchor outputs; provenance means you can point to the supporting passages.

Real-World Applications

  • Internal knowledge assistants: “How do we file expenses?” answered from your finance policy docs.
  • Customer support copilots: Draft replies using product manuals and prior resolved tickets.
  • Developer tools: “Where is this API defined?” answered by retrieving relevant code chunks.
  • Search + synthesis: Retrieval finds the best passages across many documents; generation turns them into a concise explanation.
  • Multimodal RAG: Retrieving not only text but also diagrams/tables/images (or their extracted representations) so answers can reflect visual documents too.

Common Misconceptions

  1. “RAG eliminates hallucinations.” It reduces them when retrieval is good, but it doesn’t magically guarantee truth. If retrieval fetches irrelevant chunks (or misses the right ones), the model may still improvise. Guardrails (like “say you don’t know”) and evaluation matter.

  2. “RAG is just keyword search + LLM.” Keyword search helps, but modern RAG typically relies on semantic retrieval (embeddings), which can match meaning even when terms differ. Many production systems use hybrids (keyword + vector) for robustness.

  3. “More context is always better.” Dumping 50 chunks into the prompt often worsens quality. You want the smallest set of the most relevant passages. Too much context dilutes signal and can confuse the model.

Further Reading

  • “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020)
  • OpenAI Cookbook: “Retrieval augmented generation using Elasticsearch”
  • LangChain docs: “RAG” / “Retrieval”
  • LlamaIndex docs: “Understanding RAG”