Chunking Strategies That Actually Work
— Effective chunking is an information architecture problem, not a text-splitting task. This article covers practical chunking strategies that improve retrieval accuracy in real-world RAG systems.
Why Chunking Determines RAG Success
Most RAG failures happen before the LLM is even called:
- Query returns irrelevant documents
- Relevant information is split across chunks
- Retrieved chunks lack necessary context
The root cause is usually poor chunking strategy.
Chunking is not “split text into N-character pieces.” It is information architecture: how do you structure knowledge so relevant information can be found and used?
This article covers:
- Why naive chunking fails
- Strategies that work in production
- Trade-offs between approaches
- How to evaluate chunking quality
The Problem with Naive Chunking
Approach 1: Fixed Character Count
# Simple but problematic
def chunk_by_chars(text: str, size: int = 500) -> list[str]:
chunks = []
for i in range(0, len(text), size):
chunks.append(text[i:i+size])
return chunks
# Problems:
doc = "The API rate limit is 1000 requests per hour. To increase your limit, con"
# Chunk breaks mid-sentence
# Key information split across chunks
Why this fails:
- Breaks semantic units arbitrarily
- Context lost at boundaries
- No concept of document structure
Approach 2: Fixed Token Count
def chunk_by_tokens(text: str, max_tokens: int = 128) -> list[str]:
tokens = tokenize(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i+max_tokens]
chunks.append(detokenize(chunk_tokens))
return chunks
# Better than characters, but still arbitrary
Why this is better but not enough:
- Respects token boundaries (important for embeddings)
- Still breaks semantic units
- Ignores document structure
Approach 3: Sentence-Based
def chunk_by_sentences(text: str, sentences_per_chunk: int = 5):
sentences = split_into_sentences(text)
chunks = []
for i in range(0, len(sentences), sentences_per_chunk):
chunk = " ".join(sentences[i:i+sentences_per_chunk])
chunks.append(chunk)
return chunks
Progress, but limitations:
- Preserves sentence integrity
- But loses section/paragraph context
- May group unrelated sentences
Strategy 1: Semantic Chunking
Principle
Chunk by meaning, not by length.
Each chunk should represent a coherent semantic unit:
- A complete thought
- A self-contained concept
- An answerable piece of information
Implementation: Paragraph-Based
def chunk_by_paragraphs(text: str, max_tokens: int = 512) -> list[str]:
"""
Chunk by paragraphs, respecting semantic boundaries.
Combine small paragraphs, split large ones.
"""
paragraphs = text.split('\n\n')
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = count_tokens(para)
# Single paragraph too large → split at sentence boundaries
if para_tokens > max_tokens:
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
current_chunk = []
current_tokens = 0
# Split large paragraph
sentences = split_into_sentences(para)
for sent in sentences:
sent_tokens = count_tokens(sent)
if current_tokens + sent_tokens > max_tokens:
chunks.append(' '.join(current_chunk))
current_chunk = [sent]
current_tokens = sent_tokens
else:
current_chunk.append(sent)
current_tokens += sent_tokens
else:
# Add paragraph to current chunk if it fits
if current_tokens + para_tokens > max_tokens:
chunks.append('\n\n'.join(current_chunk))
current_chunk = [para]
current_tokens = para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
When to use:
- Prose documents (articles, documentation, books)
- Content written in natural paragraphs
- When semantic coherence matters
Strategy 2: Structure-Aware Chunking
Principle
Respect document hierarchy and structure.
Documents have inherent structure:
- Headings and sections
- Lists and tables
- Code blocks
- Metadata
Implementation: Markdown-Aware
from markdown_it import MarkdownIt
def chunk_by_structure(markdown: str, max_tokens: int = 512) -> list[dict]:
"""
Chunk markdown respecting structure.
Each chunk includes its context (heading hierarchy).
"""
md = MarkdownIt()
tokens = md.parse(markdown)
chunks = []
current_section = []
heading_context = [] # Track heading hierarchy
for token in tokens:
if token.type == 'heading_open':
level = int(token.tag[1]) # h1 → 1, h2 → 2, etc.
# Finalize previous section
if current_section:
chunks.append({
'content': ''.join(current_section),
'context': heading_context.copy(),
'level': len(heading_context)
})
current_section = []
# Update heading hierarchy
heading_context = heading_context[:level-1]
elif token.type == 'heading_close':
# Add heading to context
heading_text = current_section[-1] if current_section else ""
heading_context.append(heading_text)
current_section = []
elif token.type == 'inline':
current_section.append(token.content)
else:
current_section.append(token.content)
# Check token limit
chunk_tokens = count_tokens(''.join(current_section))
if chunk_tokens > max_tokens:
chunks.append({
'content': ''.join(current_section),
'context': heading_context.copy(),
'level': len(heading_context)
})
current_section = []
if current_section:
chunks.append({
'content': ''.join(current_section),
'context': heading_context.copy(),
'level': len(heading_context)
})
return chunks
# Example output:
# {
# 'content': 'To reset your password, click the "Forgot Password" link...',
# 'context': ['Account Management', 'Password Reset'],
# 'level': 2
# }
Augmenting Chunks with Context
def augment_chunk_with_context(chunk: dict) -> str:
"""
Include heading hierarchy in chunk for better retrieval.
"""
context_path = " > ".join(chunk['context'])
return f"""
Section: {context_path}
{chunk['content']}
"""
# Query: "How do I reset my password?"
# Without context: Matches generic password text
# With context: Matches "Account Management > Password Reset" section
When to use:
- Structured documentation
- Technical manuals
- Knowledge bases with clear hierarchy
- When section context improves retrieval
Strategy 3: Sliding Window with Overlap
Principle
Prevent information loss at chunk boundaries.
If relevant information spans a boundary, neither chunk is complete.
Implementation
def chunk_with_overlap(text: str, chunk_size: int = 512, overlap: int = 128):
"""
Create overlapping chunks to preserve boundary context.
"""
tokens = tokenize(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append({
'content': detokenize(chunk_tokens),
'start': start,
'end': min(end, len(tokens))
})
# Next chunk starts `overlap` tokens before current end
start = end - overlap
return chunks
# Example:
# Chunk 1: tokens 0-512
# Chunk 2: tokens 384-896 (overlap of 128 tokens)
# Chunk 3: tokens 768-1280 (overlap of 128 tokens)
Trade-offs
Pros:
- Reduces boundary information loss
- Improves recall for queries spanning sections
Cons:
- Increases storage (duplicate content)
- Increases retrieval complexity (duplicate results)
- Higher embedding costs
Deduplication Strategy
def deduplicate_retrieved_chunks(chunks: list[dict]) -> list[dict]:
"""
When retrieving overlapping chunks, merge and deduplicate.
"""
# Sort by document position
sorted_chunks = sorted(chunks, key=lambda c: (c['doc_id'], c['start']))
deduplicated = []
for chunk in sorted_chunks:
if not deduplicated:
deduplicated.append(chunk)
continue
last = deduplicated[-1]
# Check if this chunk overlaps with previous
if chunk['doc_id'] == last['doc_id'] and chunk['start'] < last['end']:
# Merge overlapping chunks
last['content'] = merge_overlapping_text(
last['content'],
chunk['content'],
last['end'] - chunk['start']
)
last['end'] = max(last['end'], chunk['end'])
else:
deduplicated.append(chunk)
return deduplicated
When to use:
- Critical retrieval scenarios (medical, legal)
- When information frequently spans boundaries
- When storage cost is acceptable
Strategy 4: Query-Aware Chunking
Principle
Optimize chunks for how they will be queried.
Different query types need different chunking:
- How-to queries: Need complete procedures
- Definition queries: Need focused concepts
- Comparison queries: Need related items together
Implementation: Multi-Granularity Chunking
class MultiGranularityChunker:
"""
Create chunks at multiple levels of granularity.
Retrieve at the granularity matching the query.
"""
def chunk_document(self, doc: str) -> dict:
return {
'summary': self.extract_summary(doc), # High-level
'sections': self.chunk_by_sections(doc), # Medium-level
'paragraphs': self.chunk_by_paragraphs(doc) # Fine-grained
}
def retrieve(self, query: str, granularity: str):
# Classify query type
if self.is_summary_query(query):
return self.search(query, index='summary')
elif self.is_detailed_query(query):
return self.search(query, index='paragraphs')
else:
return self.search(query, index='sections')
# Example queries:
# "What does this API do?" → Summary level
# "How do I authenticate?" → Section level
# "What does the timeout parameter mean?" → Paragraph level
Parent-Child Chunking
class HierarchicalChunker:
"""
Embed small chunks, but retrieve larger parent context.
"""
def create_chunks(self, doc: str):
sections = split_by_sections(doc)
chunks = []
for section_id, section in enumerate(sections):
paragraphs = split_by_paragraphs(section)
for para_id, para in enumerate(paragraphs):
chunks.append({
'id': f"{section_id}:{para_id}",
'content': para, # Small chunk for embedding
'parent': section, # Full section context
'metadata': {
'section_id': section_id,
'section_title': get_section_title(section)
}
})
return chunks
def retrieve_with_context(self, query: str, top_k: int = 5):
# Search at paragraph level (precise matching)
matches = self.vector_search(query, top_k)
# Return parent sections (full context)
results = []
for match in matches:
results.append({
'content': match['parent'], # Full section, not just paragraph
'relevance_score': match['score'],
'metadata': match['metadata']
})
return results
When to use:
- Complex documents with nested structure
- When precise matching but broad context needed
- When query patterns are predictable
Strategy 5: Domain-Specific Chunking
Code Documentation
def chunk_code_docs(doc: str) -> list[dict]:
"""
Chunk technical documentation by code elements.
"""
chunks = []
# Each function/class is a chunk
elements = extract_code_elements(doc) # Functions, classes, etc.
for element in elements:
chunks.append({
'type': element['type'], # 'function', 'class', etc.
'name': element['name'],
'signature': element['signature'],
'description': element['description'],
'examples': element['examples'],
'parameters': element['parameters'],
'returns': element['returns']
})
return chunks
# Query: "How do I use the authenticate() function?"
# Retrieves: Complete function documentation including signature, params, examples
Conversational Data
def chunk_conversations(messages: list[dict]) -> list[dict]:
"""
Chunk chat logs by conversation turns or topics.
"""
chunks = []
current_topic_messages = []
for msg in messages:
# Detect topic changes
if detect_topic_change(current_topic_messages, msg):
if current_topic_messages:
chunks.append({
'messages': current_topic_messages,
'summary': summarize_conversation(current_topic_messages),
'participants': get_participants(current_topic_messages)
})
current_topic_messages = [msg]
else:
current_topic_messages.append(msg)
if current_topic_messages:
chunks.append({
'messages': current_topic_messages,
'summary': summarize_conversation(current_topic_messages),
'participants': get_participants(current_topic_messages)
})
return chunks
Tabular Data
def chunk_tables(table: pd.DataFrame) -> list[dict]:
"""
Chunk tables by rows or semantic groupings.
"""
chunks = []
# Strategy: Each row as a chunk (for small tables)
for idx, row in table.iterrows():
chunks.append({
'type': 'table_row',
'table_name': table.name,
'columns': table.columns.tolist(),
'values': row.to_dict(),
'text': format_row_as_text(row, table.columns)
})
return chunks
# Query: "What is John's email address?"
# Retrieves: Row where name='John', including email column
When to use:
- Specialized content types
- When generic chunking loses critical structure
- When domain knowledge improves retrieval
Evaluating Chunking Quality
Metric 1: Retrieval Precision
def evaluate_retrieval_precision(test_cases: list[dict], chunking_strategy):
"""
What percentage of retrieved chunks are relevant?
"""
results = []
for case in test_cases:
chunks = chunking_strategy(case['document'])
retrieved = retrieve(case['query'], chunks, top_k=5)
relevant_count = sum(
1 for chunk in retrieved
if is_relevant(chunk, case['expected_content'])
)
precision = relevant_count / len(retrieved)
results.append(precision)
return mean(results)
Metric 2: Answer Completeness
def evaluate_answer_completeness(test_cases, chunking_strategy):
"""
Do retrieved chunks contain all information needed to answer?
"""
results = []
for case in test_cases:
chunks = chunking_strategy(case['document'])
retrieved = retrieve(case['query'], chunks, top_k=5)
# Generate answer from retrieved chunks
answer = generate_answer(case['query'], retrieved)
# Check if answer contains expected information
completeness = check_completeness(answer, case['expected_answer'])
results.append(completeness)
return mean(results)
Metric 3: Chunk Coherence
def evaluate_chunk_coherence(chunks: list[str]) -> float:
"""
Do chunks represent coherent semantic units?
"""
coherence_scores = []
for chunk in chunks:
# Measure semantic coherence
sentences = split_into_sentences(chunk)
if len(sentences) < 2:
coherence_scores.append(1.0)
continue
# Compare sentence embeddings
embeddings = [embed(sent) for sent in sentences]
similarities = []
for i in range(len(embeddings) - 1):
sim = cosine_similarity(embeddings[i], embeddings[i+1])
similarities.append(sim)
coherence_scores.append(mean(similarities))
return mean(coherence_scores)
Practical Implementation Guide
Step 1: Analyze Your Data
def analyze_document_structure(docs: list[str]):
"""
Understand document characteristics before choosing strategy.
"""
analysis = {
'avg_length': mean([len(doc) for doc in docs]),
'has_structure': check_for_headings(docs),
'content_type': detect_content_type(docs), # prose, code, tables
'typical_queries': analyze_query_patterns()
}
return analysis
Step 2: Choose Base Strategy
def select_chunking_strategy(analysis: dict):
if analysis['has_structure']:
return structure_aware_chunking
elif analysis['content_type'] == 'code':
return code_specific_chunking
elif analysis['content_type'] == 'conversation':
return conversation_chunking
else:
return semantic_chunking
Step 3: Test and Iterate
# Compare strategies empirically
strategies = [
('naive', chunk_by_tokens),
('semantic', chunk_by_paragraphs),
('structure', chunk_by_structure),
('overlap', lambda d: chunk_with_overlap(d, overlap=128))
]
results = []
for name, strategy in strategies:
precision = evaluate_retrieval_precision(test_cases, strategy)
completeness = evaluate_answer_completeness(test_cases, strategy)
results.append({
'strategy': name,
'precision': precision,
'completeness': completeness
})
best_strategy = max(results, key=lambda r: r['precision'] + r['completeness'])
Common Pitfalls
Pitfall 1: Optimizing for Chunk Count
# Wrong: Trying to minimize number of chunks
# Right: Optimizing for retrieval quality
Pitfall 2: One-Size-Fits-All
# Wrong: Same chunking for all document types
# Right: Strategy matched to content type
Pitfall 3: Ignoring Query Patterns
# Wrong: Chunking without understanding queries
# Right: Analyze queries, optimize chunks accordingly
Pitfall 4: No Evaluation
# Wrong: Choose strategy based on intuition
# Right: Measure retrieval quality empirically
Conclusion
Chunking is not text splitting. It is information architecture.
Effective chunking requires:
- Understanding content structure: Respect document hierarchy
- Preserving semantic units: Avoid splitting coherent information
- Matching query patterns: Optimize for how content will be queried
- Empirical validation: Measure retrieval quality, not chunk count
Start with semantic or structure-aware chunking. Add complexity (overlap, multi-granularity, domain-specific) only when justified by measurement.
The best chunking strategy is the one that surfaces relevant information when users ask questions.
Continue learning
Next in this path
Most RAG failures stem from poor retrieval, not weak models. This article explains why retrieval is difficult, how to improve it, and how to debug retrieval failures systematically.