Retrieval Is the Hard Part
— Most RAG failures stem from poor retrieval, not weak models. This article explains why retrieval is difficult, how to improve it, and how to debug retrieval failures systematically.
The Illusion of Simple Similarity Search
Many engineers think RAG works like this:
# Seems simple
def rag(query: str) -> str:
embeddings = vector_db.search(query, top_k=5)
docs = [e.content for e in embeddings]
return llm.generate(f"Context: {docs}\n\nQuestion: {query}")
This rarely works in production.
The reality:
- Embeddings capture semantic similarity, not relevance
- Top-k results often miss critical information
- Queries and documents live in different semantic spaces
- No single retrieval strategy works for all queries
Most RAG failures happen at retrieval, not generation.
Why Retrieval Is Hard
Problem 1: Semantic Similarity ≠ Relevance
# Query
query = "How do I reset my password?"
# High similarity, low relevance
doc_1 = "You can reset your username by contacting support."
# Contains: reset, contact support
# Missing: password reset procedure
# Lower similarity, high relevance
doc_2 = "To reset password: click Settings > Security > Reset Password"
# Different wording, but exactly what user needs
Embedding models optimize for semantic similarity, not task relevance.
Problem 2: Query-Document Mismatch
# User query: Short, informal
query = "api broken"
# Document: Long, formal
doc = """
API Service Status and Troubleshooting Guide
This document describes common issues encountered when
integrating with our REST API, including authentication
failures, rate limiting, timeout errors, and connectivity
problems.
"""
# Poor embedding match despite high relevance
**Queries and documents have different:
- Length (5 words vs 500 words)
- Style (casual vs formal)
- Vocabulary (user terms vs technical jargon)**
Problem 3: Multi-Intent Queries
# Query with multiple information needs
query = "What features are included in Pro plan and how much does it cost?"
# Relevant documents are scattered
doc_1 = "Pro plan includes: advanced analytics, priority support..."
doc_2 = "Pricing: Free $0/mo, Pro $49/mo, Enterprise custom"
doc_3 = "Feature comparison: Free vs Pro vs Enterprise"
# Single vector search might miss some
Complex queries require information from multiple sources.
Problem 4: Context-Dependent Relevance
# Same query, different contexts
query = "How do I deploy?"
# Context 1: First-time user
relevant_doc_1 = "Deployment quick start guide"
# Context 2: Experienced developer
relevant_doc_2 = "Advanced deployment configurations and CI/CD integration"
# Context 3: Troubleshooting
relevant_doc_3 = "Common deployment errors and solutions"
# Relevance depends on user state, not just query
Retrieval Strategy 1: Hybrid Search
Combining Dense and Sparse Retrieval
class HybridRetriever:
"""
Combine semantic search (dense) with keyword search (sparse).
"""
def __init__(self, vector_db, keyword_index):
self.vector_db = vector_db
self.keyword_index = keyword_index # BM25, Elasticsearch
def retrieve(self, query: str, top_k: int = 10) -> list[dict]:
# Dense retrieval: Semantic similarity
dense_results = self.vector_db.search(
query=query,
top_k=top_k
)
# Sparse retrieval: Keyword matching
sparse_results = self.keyword_index.search(
query=query,
top_k=top_k
)
# Combine with weighted score
combined = self.merge_results(
dense_results,
sparse_results,
dense_weight=0.7,
sparse_weight=0.3
)
return combined[:top_k]
def merge_results(self, dense, sparse, dense_weight, sparse_weight):
"""
Reciprocal Rank Fusion for combining rankings.
"""
scores = {}
# Score from dense retrieval
for rank, doc in enumerate(dense):
doc_id = doc['id']
scores[doc_id] = scores.get(doc_id, 0) + dense_weight / (rank + 1)
# Score from sparse retrieval
for rank, doc in enumerate(sparse):
doc_id = doc['id']
scores[doc_id] = scores.get(doc_id, 0) + sparse_weight / (rank + 1)
# Sort by combined score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [{'id': doc_id, 'score': score} for doc_id, score in ranked]
Why this helps:
- Dense: Captures semantic meaning
- Sparse: Matches exact terms (product names, error codes)
- Combination: More robust across query types
When to Use Hybrid Search
# Queries that benefit from keyword matching
keyword_heavy = [
"error code E404", # Exact code match
"Python 3.11 support", # Specific version
"JWT authentication", # Technical term
]
# Queries that benefit from semantic search
semantic_heavy = [
"How do I get started?", # Conceptual
"Why is my request slow?", # No exact keywords
"Best practices for security" # Abstract concept
]
# Hybrid works for both
Retrieval Strategy 2: Query Expansion
Problem: Users Don’t Know the Right Terms
# User query
query = "How do I login?"
# Documentation uses
actual_term = "authentication"
# Semantic similarity may be weak
Solution: Expand Query with Synonyms
class QueryExpander:
"""
Expand query to include related terms.
"""
def expand(self, query: str) -> str:
# Use LLM to generate related terms
expansion_prompt = f"""
Given this search query, list related technical terms and synonyms.
Query: {query}
Related terms (comma-separated):
"""
related_terms = llm.generate(expansion_prompt)
# Combine original + expanded
expanded = f"{query} {related_terms}"
return expanded
# Example
original = "login issues"
expanded = "login issues authentication sign-in auth errors credentials"
# Broader semantic coverage
Hypothetical Document Embedding (HyDE)
class HyDERetriever:
"""
Generate hypothetical answer, then search for similar documents.
"""
def retrieve(self, query: str, top_k: int = 5):
# Step 1: Generate hypothetical answer
prompt = f"""
Answer this question with a detailed technical response:
{query}
"""
hypothetical_doc = llm.generate(prompt)
# Step 2: Search using hypothetical answer
# (More likely to match actual documents)
results = self.vector_db.search(
query=hypothetical_doc,
top_k=top_k
)
return results
# Why this works:
# Query: "How do I reset password?"
# Hypothetical: "To reset your password, navigate to Settings, click Security, then Reset Password..."
# This matches documentation better than the short query
Retrieval Strategy 3: Metadata Filtering
Adding Structure to Vector Search
class MetadataFilteredRetriever:
"""
Filter by metadata before or after vector search.
"""
def retrieve(self, query: str, filters: dict = None, top_k: int = 5):
# Pre-filter by metadata
if filters:
candidate_docs = self.filter_by_metadata(filters)
else:
candidate_docs = None
# Vector search within filtered subset
results = self.vector_db.search(
query=query,
filter=candidate_docs,
top_k=top_k
)
return results
def filter_by_metadata(self, filters: dict):
"""
Example filters:
{
'category': 'api-documentation',
'version': '2.0',
'language': 'python',
'last_updated': {'$gte': '2025-01-01'}
}
"""
return self.vector_db.filter(filters)
# Usage
retriever.retrieve(
query="How do I authenticate?",
filters={
'category': 'authentication',
'version': 'latest'
}
)
Why metadata matters:
- Scope search to relevant subset
- Filter by date (get current docs)
- Filter by type (code vs prose)
- Filter by permission level (user access)
Enriching Chunks with Metadata
def create_enriched_chunks(doc: dict) -> list[dict]:
"""
Add metadata that improves filtering.
"""
chunks = chunk_document(doc['content'])
enriched = []
for chunk in chunks:
enriched.append({
'content': chunk,
'metadata': {
'doc_id': doc['id'],
'title': doc['title'],
'section': chunk.get('section'),
'category': doc['category'],
'tags': doc['tags'],
'version': doc['version'],
'last_updated': doc['last_updated'],
'author': doc['author'],
'content_type': detect_content_type(chunk),
'language': detect_language(chunk),
'word_count': len(chunk.split()),
'has_code': contains_code_block(chunk)
}
})
return enriched
Retrieval Strategy 4: Reranking
Two-Stage Retrieval
class RerankedRetriever:
"""
Stage 1: Fast, broad retrieval
Stage 2: Expensive, accurate reranking
"""
def __init__(self, vector_db, reranker):
self.vector_db = vector_db
self.reranker = reranker # Cross-encoder or LLM
def retrieve(self, query: str, top_k: int = 5):
# Stage 1: Retrieve more candidates (cheap)
candidates = self.vector_db.search(
query=query,
top_k=top_k * 5 # Over-retrieve
)
# Stage 2: Rerank candidates (expensive)
reranked = self.reranker.rank(
query=query,
documents=candidates
)
return reranked[:top_k]
# Why this works:
# - Vector search: Fast but approximate
# - Reranker: Slow but precise
# - Over-retrieve + rerank = best of both
Cross-Encoder Reranking
from sentence_transformers import CrossEncoder
class CrossEncoderReranker:
"""
Score query-document pairs jointly.
More accurate than separate embeddings.
"""
def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
self.model = CrossEncoder(model_name)
def rank(self, query: str, documents: list[dict]) -> list[dict]:
# Create query-document pairs
pairs = [[query, doc['content']] for doc in documents]
# Score all pairs
scores = self.model.predict(pairs)
# Sort by score
for doc, score in zip(documents, scores):
doc['rerank_score'] = score
ranked = sorted(documents, key=lambda d: d['rerank_score'], reverse=True)
return ranked
LLM-Based Reranking
class LLMReranker:
"""
Use LLM to judge relevance.
Most expensive but most accurate.
"""
def rank(self, query: str, documents: list[dict], top_k: int = 5):
ranking_prompt = f"""
Query: {query}
Documents:
{self.format_docs(documents)}
Task: Rank these documents by relevance to the query.
Output only document IDs in order, most relevant first.
Format: [id1, id2, id3, ...]
"""
ranked_ids = llm.generate(ranking_prompt)
ranked_ids = json.loads(ranked_ids)
# Reorder documents
doc_map = {doc['id']: doc for doc in documents}
return [doc_map[doc_id] for doc_id in ranked_ids[:top_k]]
Retrieval Strategy 5: Contextual Retrieval
Using Conversation History
class ContextualRetriever:
"""
Use conversation context to improve retrieval.
"""
def __init__(self, vector_db):
self.vector_db = vector_db
self.conversation_history = []
def retrieve(self, query: str, top_k: int = 5):
# Rewrite query with context
contextual_query = self.rewrite_with_context(
query,
self.conversation_history
)
# Search with contextual query
results = self.vector_db.search(
query=contextual_query,
top_k=top_k
)
# Update history
self.conversation_history.append({
'query': query,
'results': [r['id'] for r in results]
})
return results
def rewrite_with_context(self, query: str, history: list):
if not history:
return query
prompt = f"""
Conversation history:
{self.format_history(history[-3:])} # Last 3 turns
Current query: {query}
Rewrite the current query to be self-contained,
incorporating relevant context from history.
Rewritten query:
"""
return llm.generate(prompt)
# Example:
# User: "How do I create an API key?"
# Bot: [Explains API key creation]
# User: "Where do I use it?" # Ambiguous
# Rewritten: "Where do I use the API key in my requests?"
Debugging Retrieval Failures
Step 1: Measure Retrieval Quality
class RetrievalEvaluator:
"""
Quantify retrieval performance.
"""
def evaluate(self, test_cases: list[dict]) -> dict:
metrics = {
'precision': [],
'recall': [],
'mrr': [] # Mean Reciprocal Rank
}
for case in test_cases:
results = self.retriever.retrieve(case['query'], top_k=10)
relevant_ids = set(case['relevant_doc_ids'])
retrieved_ids = [r['id'] for r in results]
# Precision: What fraction of retrieved docs are relevant?
relevant_retrieved = set(retrieved_ids) & relevant_ids
precision = len(relevant_retrieved) / len(retrieved_ids)
metrics['precision'].append(precision)
# Recall: What fraction of relevant docs were retrieved?
recall = len(relevant_retrieved) / len(relevant_ids)
metrics['recall'].append(recall)
# MRR: Position of first relevant doc
for rank, doc_id in enumerate(retrieved_ids, 1):
if doc_id in relevant_ids:
metrics['mrr'].append(1 / rank)
break
else:
metrics['mrr'].append(0)
return {
'precision': mean(metrics['precision']),
'recall': mean(metrics['recall']),
'mrr': mean(metrics['mrr'])
}
Step 2: Identify Failure Patterns
def analyze_failures(test_cases: list[dict], retriever):
"""
Categorize retrieval failures.
"""
failure_types = {
'missing_relevant': [], # Relevant docs not retrieved
'low_ranking': [], # Relevant docs ranked too low
'irrelevant_noise': [], # Too many irrelevant docs
'query_mismatch': [] # Query-document vocabulary gap
}
for case in test_cases:
results = retriever.retrieve(case['query'], top_k=10)
relevant_ids = set(case['relevant_doc_ids'])
retrieved_ids = [r['id'] for r in results]
# Missing relevant docs?
missing = relevant_ids - set(retrieved_ids)
if missing:
failure_types['missing_relevant'].append({
'query': case['query'],
'missing_docs': missing
})
# Relevant docs ranked low?
for doc_id in relevant_ids:
if doc_id in retrieved_ids:
rank = retrieved_ids.index(doc_id) + 1
if rank > 5: # Should be in top 5
failure_types['low_ranking'].append({
'query': case['query'],
'doc_id': doc_id,
'rank': rank
})
# Too many irrelevant docs in top results?
top_5 = set(retrieved_ids[:5])
irrelevant_in_top = top_5 - relevant_ids
if len(irrelevant_in_top) > 3:
failure_types['irrelevant_noise'].append({
'query': case['query'],
'irrelevant_count': len(irrelevant_in_top)
})
return failure_types
Step 3: Test Retrieval Strategies
def compare_strategies(test_cases: list[dict]):
"""
Compare different retrieval approaches.
"""
strategies = {
'baseline': VectorRetriever(vector_db),
'hybrid': HybridRetriever(vector_db, keyword_index),
'reranked': RerankedRetriever(vector_db, reranker),
'contextual': ContextualRetriever(vector_db)
}
results = {}
for name, strategy in strategies.items():
evaluator = RetrievalEvaluator(strategy)
metrics = evaluator.evaluate(test_cases)
results[name] = metrics
# Compare
print("Retrieval Strategy Comparison:")
for name, metrics in results.items():
print(f"{name}:")
print(f" Precision: {metrics['precision']:.3f}")
print(f" Recall: {metrics['recall']:.3f}")
print(f" MRR: {metrics['mrr']:.3f}")
return results
Production Retrieval Pipeline
End-to-End Implementation
class ProductionRetriever:
"""
Production-grade retrieval with fallbacks and monitoring.
"""
def __init__(self, config: dict):
self.vector_db = VectorDB(config['vector_db'])
self.keyword_index = KeywordIndex(config['keyword_index'])
self.reranker = CrossEncoderReranker(config['reranker_model'])
self.metadata_filters = config.get('metadata_filters', {})
def retrieve(
self,
query: str,
top_k: int = 5,
filters: dict = None,
context: list = None
) -> list[dict]:
"""
Multi-stage retrieval with monitoring.
"""
start_time = time.time()
try:
# Stage 1: Query preprocessing
processed_query = self.preprocess_query(query, context)
# Stage 2: Hybrid retrieval (over-retrieve)
candidates = self.hybrid_search(
processed_query,
top_k=top_k * 5,
filters=filters
)
if not candidates:
# Fallback: Relax filters
candidates = self.hybrid_search(
processed_query,
top_k=top_k * 5,
filters=None
)
# Stage 3: Rerank
reranked = self.reranker.rank(processed_query, candidates)
# Stage 4: Post-processing
results = self.postprocess_results(reranked[:top_k])
# Log metrics
self.log_retrieval(
query=query,
results=results,
latency_ms=(time.time() - start_time) * 1000
)
return results
except Exception as e:
self.log_error(query, e)
return self.fallback_retrieval(query, top_k)
def hybrid_search(self, query: str, top_k: int, filters: dict):
# Vector search
dense = self.vector_db.search(query, top_k, filters)
# Keyword search
sparse = self.keyword_index.search(query, top_k, filters)
# Merge
return self.merge_results(dense, sparse)
def fallback_retrieval(self, query: str, top_k: int):
"""
Simple fallback when primary retrieval fails.
"""
return self.keyword_index.search(query, top_k)
def log_retrieval(self, query: str, results: list, latency_ms: float):
"""
Track retrieval metrics for monitoring.
"""
log_event({
'type': 'retrieval',
'query': query,
'num_results': len(results),
'latency_ms': latency_ms,
'top_score': results[0]['score'] if results else 0,
'timestamp': datetime.now()
})
Conclusion
Retrieval is where RAG succeeds or fails.
Key principles:
- Embeddings alone are insufficient: Combine dense and sparse search
- One-stage retrieval is brittle: Use multi-stage pipelines
- Context matters: Rewrite queries, use metadata, incorporate history
- Measure continuously: Track precision, recall, and ranking quality
- Have fallbacks: Retrieval will fail; handle it gracefully
Most engineering effort in RAG should focus on retrieval, not generation. The model can only work with what retrieval gives it.
Get retrieval right, and RAG works. Get retrieval wrong, and no amount of prompt engineering will save it.
Continue learning
Next in this path
Evaluating RAG Quality: Precision, Recall, and Faithfulness
Without evaluation, RAG systems cannot improve reliably. This article introduces practical metrics and evaluation strategies for measuring retrieval accuracy, answer grounding, and regression over time.