Chunking & Indexing Strategies for RAG

What Is Chunking & Indexing Strategies for RAG?

In Retrieval-Augmented Generation (RAG), the model answers using retrieved snippets from your data. The trick is: your data isn’t naturally shaped into perfect, bite-sized facts. It’s messy PDFs, docs, wikis, tickets, and pages with headings, lists, and tangents.

Chunking is how you cut large documents into smaller pieces that can be embedded and retrieved effectively. Indexing strategies are how you store those chunks (plus metadata) so retrieval is fast, filterable, and likely to return the right context.

Analogy: imagine you’re making a “recipe box” from a huge cookbook collection. If you tear out whole chapters, it’s hard to find the exact step you need. If you cut every sentence into separate scraps, you lose the “why” and “how” around it. Good chunking is cutting the pages into cards that keep a complete idea together. Good indexing is how you label and sort those cards so you can find them instantly.

Technically, chunking is a preprocessing step that produces retrieval units (often 200–1,000 tokens each) with optional overlap, and indexing is storing each unit with an embedding and metadata in a retrieval system (vector index, hybrid search, or multi-level index).

Why Does It Matter?

Because most “RAG failures” aren’t about the LLM being dumb—they’re about retrieval feeding it the wrong stuff.

Chunking and indexing directly impact:

Answer quality (grounding): If you retrieve irrelevant or incomplete context, the model will confidently improvise.
Recall vs precision: Big chunks can contain the answer but also lots of noise. Small chunks are precise but may miss essential context.
Latency and cost: More chunks and larger chunk sizes can mean more embeddings to compute, more storage, and more tokens to send to the LLM.
Maintainability: Good metadata enables access control, freshness filters, and debugging (“where did this answer come from?”).

If you want RAG that feels reliable—like “it actually found the right paragraph”—this is where that reliability is built.

How It Works

A practical mental model: Split → Represent → Store → Retrieve → Assemble.

1) Split documents into chunks (choose a strategy)

Chunking has two jobs:

keep each chunk semantically coherent (a complete thought),
keep each chunk small enough to retrieve and fit into the model context.

Common chunking approaches:

A) Fixed-size chunking (token/character windows)

Split by length (e.g., 500 tokens) with overlap.
Simple and robust, but can split in the middle of a section.

B) Structure-aware chunking (preferred for docs/markdown)

Split by headings, paragraphs, list boundaries, code fences.
Produces chunks that match how humans organized the information.

C) Semantic chunking (content-aware)

Split when topic shifts (sometimes using embeddings or heuristics).
More compute, often better coherence.

A helpful rule: start structure-aware when possible, fall back to fixed-size when structure is unreliable.

2) Choose chunk size (and measure it in tokens if you can)

Chunk size is usually discussed in tokens, because:

embeddings are computed over tokens,
LLM context windows are token-based,
overlap should also be token-based.

General intuition:

Smaller chunks → more precise retrieval, less noise, but risk missing context.
Larger chunks → more context per hit, but embeddings get “averaged” over multiple ideas, and retrieval can become fuzzy.

Practical starting points (not commandments):

FAQ / knowledge base / policies: ~300–800 tokens
Technical docs with dense info: ~500–1,000 tokens
Short tickets/messages: chunk by message or paragraph (often no need for large windows)

Then evaluate. Chunk size is a knob, not a religion.

3) Add overlap (a seatbelt for boundary cuts)

Overlap means consecutive chunks share some tokens.

Why it helps:

If a key sentence is split across a boundary, overlap ensures one chunk still contains the full idea.
It reduces “fragmentation” where the answer is half in chunk A and half in chunk B.

Typical starting overlap is often described as 10–20% of chunk size (e.g., 50–100 tokens overlap for 500-token chunks). Too much overlap creates near-duplicate chunks, which can waste retrieval slots and inflate index size.

4) Attach metadata (so retrieval can behave like a product, not a demo)

Metadata is what makes your vector store usable in the real world.

Useful fields to store per chunk:

doc_id (stable identifier for the source document)
source_url or path
title
section_path (e.g., ["HR Handbook", "Leave", "Sick Leave"])
created_at, updated_at
tags (domain labels)
tenant_id / team_id (multi-tenant separation)
acl / visibility rules (who can see it)
chunk_index and char_range or token_range (for debugging and highlighting)

This is how you enable filtering like:

“only show docs from Team X”
“only updated after last quarter”
“only public content”
“only section = Security Policies”

5) Indexing strategies (how you store chunks for better retrieval)

Once you have chunks + metadata, you still need to decide how to index them.

Strategy A: Flat chunk index (baseline)

Store each chunk as one embedding.
Retrieve top-k chunks by similarity.
Works well as a first implementation.

Strategy B: Parent–child indexing (retrieve small, return big)

Embed smaller “child” chunks for precise matching,
but return a larger “parent” window (e.g., the whole section) to the LLM.
This reduces “missing context” while keeping retrieval accurate.

Strategy C: Hierarchical indexing (multi-level retrieval)

Index multiple granularities:
- sentence/paragraph level (fine)
- section level (medium)
- document summary level (coarse)
Retrieval can pick the appropriate level based on the question.

A research-flavored variant is tree-based retrieval with recursive summaries (e.g., retrieving from multiple levels of abstraction). This helps questions that require “big picture” context instead of one local paragraph.

Strategy D: Multi-vector per document/section

Store multiple embeddings for the same content:
- one for the raw text,
- one for a generated summary,
- one for extracted keywords/entities.
Queries can match against whichever representation is most aligned.

6) Test retrieval like you test code

A simple evaluation loop:

Collect ~30–100 real questions you care about.
For each question, inspect retrieved chunks:
- Are they relevant?
- Do they contain the answer?
- Are they missing necessary surrounding context?
Adjust one knob at a time:
- chunk size, overlap, splitting method, metadata filters, top-k
Repeat.

Most teams get a big quality jump just by iterating on chunking + metadata + retrieval parameters before touching the generator.

Key Terminology

Chunk: A retrievable unit of text (often a paragraph/section window) that gets embedded and stored.
Chunk overlap: Shared content between adjacent chunks to prevent boundary cuts from losing meaning.
Metadata filtering: Restricting retrieval by attributes like tenant, doc type, freshness, or permissions.
Parent–child retrieval: Using small chunks for matching but returning larger context windows for generation.
Hierarchical retrieval: Retrieving from multiple levels (chunk/section/doc summary) to balance precision and context.

Real-World Applications

Internal policy assistant: Chunk by headings (e.g., “Leave → Sick Leave”), store section_path, filter by department, return parent sections for clarity.
Engineering docs RAG: Split by markdown headers and code blocks, store file paths + repo version, retrieve top-k chunks then rerank.
Customer support copilot: Chunk per ticket message or resolution section, attach product/version metadata, retrieve similar issues and known fixes.
Compliance/Legal search: Use strict metadata + citations, chunk around clauses, and return parent context to avoid misreading a single sentence.
Knowledge hub websites: Use structure-aware chunking for pages, add tags/categories, and power “related articles” using embeddings + filters.

Common Misconceptions

“There is a perfect chunk size.” Chunking is data-dependent. Policies, code, and chat logs behave differently. Start with defaults, then evaluate with real queries.
“More overlap always improves quality.” Too much overlap creates duplicates that crowd out diverse results. Overlap is a safety net, not a blanket.
“Indexing is just storing vectors.” Without metadata and a retrieval strategy (parent–child, hierarchical, hybrid), you’ll hit relevance, permissions, and debugging pain fast.