Chunking Is Still the #1 Bottleneck in RAG
— Despite advances in models and embeddings, chunking remains the weakest link in most RAG systems. This article explains why chunking dominates retrieval quality and how poor chunk design quietly undermines production reliability.
Why retrieval quality is decided before the model ever runs
TL;DR
In most RAG systems, failures are blamed on models, prompts, or embeddings. In reality, chunking decisions determine retrieval quality long before generation begins. Better models and larger context windows cannot compensate for poorly designed chunks. Chunking remains the single most important—and most underestimated—bottleneck in production RAG.
Why chunking is easy to underestimate
Chunking is often framed as a preprocessing task:
- Split documents
- Add overlap
- Generate embeddings
- Move on
Because chunking happens “offline,” it feels like a solved problem.
In practice, chunking defines the information units your system can ever retrieve. If those units are wrong, retrieval can never be right.
Retrieval cannot surface what chunking destroys
Retrieval works by selecting among existing chunks. It cannot:
- Reconstruct missing context
- Merge fragmented meaning
- Infer relationships split across chunks
When chunking breaks semantic boundaries, retrieval becomes a best-effort guess among bad options.
At that point, improving embeddings or models only improves how confidently the wrong chunk is selected.
The three common chunking failures
1. Chunks that are too small
Small chunks improve recall but often lose meaning.
Symptoms:
- Answers lack necessary context
- Retrieved text feels incomplete
- The model fills gaps with assumptions
2. Chunks that are too large
Large chunks preserve context but reduce selectivity.
Symptoms:
- Retrieval pulls in irrelevant information
- Signal-to-noise ratio collapses
- Context windows fill quickly
3. Arbitrary boundaries
Chunking based on character count or tokens alone ignores document structure.
Symptoms:
- Headings separated from content
- Lists split mid-thought
- Logical sections broken apart
These failures compound silently.
Why better embeddings don’t fix bad chunking
Embeddings measure similarity between chunks and queries. They do not fix:
- Missing information
- Poor boundaries
- Overloaded chunks
If chunking is wrong, embeddings faithfully retrieve the wrong thing faster.
This is why teams often observe diminishing returns from:
- New embedding models
- Higher-dimensional vectors
- More expensive similarity search
The bottleneck remains upstream.
Chunking errors amplify downstream costs
Poor chunking has cascading effects:
-
Higher cost More chunks are retrieved to compensate for missing context.
-
Higher latency Larger contexts and more tokens slow generation.
-
Lower reliability Answers vary depending on which partial chunks happen to surface.
By the time these symptoms appear, the root cause is far removed from the generation layer.
Why larger context windows make chunking harder
Large context windows create a false sense of safety:
“We can just include more chunks.”
This approach:
- Masks poor chunk boundaries
- Dilutes attention
- Makes failures harder to debug
Chunking quality matters more, not less, as context windows grow.
What production teams do differently
Teams with reliable RAG systems treat chunking as:
- An information architecture problem
- A domain-specific design choice
- A continuously evaluated component
They:
- Align chunks with semantic units
- Preserve structural metadata
- Measure retrieval effectiveness per chunk strategy
- Iterate on chunking independently of models
Chunking becomes an explicit design surface, not a one-time step.
Related Skills (Recommended Reading)
To understand and address this bottleneck:
- Chunking Strategies That Actually Work
- Retrieval Is the Hard Part
- Evaluating RAG Quality
- Why RAG Exists (And When Not to Use It)
These skills explain how chunking choices propagate through retrieval, context assembly, and evaluation.
Closing thought
RAG systems do not fail at generation. They fail at information selection.
Chunking defines the universe of information your system can reason over. Until chunking is treated as a first-class concern, RAG reliability will remain elusive—regardless of how advanced the model becomes.