Retrieval Is the Hard Part

— Most RAG failures stem from poor retrieval, not weak models. This article explains why retrieval is difficult, how to improve it, and how to debug retrieval failures systematically.

level: intermediate topics: rag tags: rag, retrieval, search, ranking, production

Many engineers think RAG works like this:

# Seems simple
def rag(query: str) -> str:
    embeddings = vector_db.search(query, top_k=5)
    docs = [e.content for e in embeddings]
    return llm.generate(f"Context: {docs}\n\nQuestion: {query}")

This rarely works in production.

The reality:

  • Embeddings capture semantic similarity, not relevance
  • Top-k results often miss critical information
  • Queries and documents live in different semantic spaces
  • No single retrieval strategy works for all queries

Most RAG failures happen at retrieval, not generation.


Why Retrieval Is Hard

Problem 1: Semantic Similarity ≠ Relevance

# Query
query = "How do I reset my password?"

# High similarity, low relevance
doc_1 = "You can reset your username by contacting support."
# Contains: reset, contact support
# Missing: password reset procedure

# Lower similarity, high relevance
doc_2 = "To reset password: click Settings > Security > Reset Password"
# Different wording, but exactly what user needs

Embedding models optimize for semantic similarity, not task relevance.

Problem 2: Query-Document Mismatch

# User query: Short, informal
query = "api broken"

# Document: Long, formal
doc = """
API Service Status and Troubleshooting Guide

This document describes common issues encountered when
integrating with our REST API, including authentication
failures, rate limiting, timeout errors, and connectivity
problems.
"""

# Poor embedding match despite high relevance

**Queries and documents have different:

  • Length (5 words vs 500 words)
  • Style (casual vs formal)
  • Vocabulary (user terms vs technical jargon)**

Problem 3: Multi-Intent Queries

# Query with multiple information needs
query = "What features are included in Pro plan and how much does it cost?"

# Relevant documents are scattered
doc_1 = "Pro plan includes: advanced analytics, priority support..."
doc_2 = "Pricing: Free $0/mo, Pro $49/mo, Enterprise custom"
doc_3 = "Feature comparison: Free vs Pro vs Enterprise"

# Single vector search might miss some

Complex queries require information from multiple sources.

Problem 4: Context-Dependent Relevance

# Same query, different contexts
query = "How do I deploy?"

# Context 1: First-time user
relevant_doc_1 = "Deployment quick start guide"

# Context 2: Experienced developer
relevant_doc_2 = "Advanced deployment configurations and CI/CD integration"

# Context 3: Troubleshooting
relevant_doc_3 = "Common deployment errors and solutions"

# Relevance depends on user state, not just query

Combining Dense and Sparse Retrieval

class HybridRetriever:
    """
    Combine semantic search (dense) with keyword search (sparse).
    """

    def __init__(self, vector_db, keyword_index):
        self.vector_db = vector_db
        self.keyword_index = keyword_index  # BM25, Elasticsearch

    def retrieve(self, query: str, top_k: int = 10) -> list[dict]:
        # Dense retrieval: Semantic similarity
        dense_results = self.vector_db.search(
            query=query,
            top_k=top_k
        )

        # Sparse retrieval: Keyword matching
        sparse_results = self.keyword_index.search(
            query=query,
            top_k=top_k
        )

        # Combine with weighted score
        combined = self.merge_results(
            dense_results,
            sparse_results,
            dense_weight=0.7,
            sparse_weight=0.3
        )

        return combined[:top_k]

    def merge_results(self, dense, sparse, dense_weight, sparse_weight):
        """
        Reciprocal Rank Fusion for combining rankings.
        """
        scores = {}

        # Score from dense retrieval
        for rank, doc in enumerate(dense):
            doc_id = doc['id']
            scores[doc_id] = scores.get(doc_id, 0) + dense_weight / (rank + 1)

        # Score from sparse retrieval
        for rank, doc in enumerate(sparse):
            doc_id = doc['id']
            scores[doc_id] = scores.get(doc_id, 0) + sparse_weight / (rank + 1)

        # Sort by combined score
        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [{'id': doc_id, 'score': score} for doc_id, score in ranked]

Why this helps:

  • Dense: Captures semantic meaning
  • Sparse: Matches exact terms (product names, error codes)
  • Combination: More robust across query types
# Queries that benefit from keyword matching
keyword_heavy = [
    "error code E404",           # Exact code match
    "Python 3.11 support",       # Specific version
    "JWT authentication",        # Technical term
]

# Queries that benefit from semantic search
semantic_heavy = [
    "How do I get started?",     # Conceptual
    "Why is my request slow?",   # No exact keywords
    "Best practices for security" # Abstract concept
]

# Hybrid works for both

Retrieval Strategy 2: Query Expansion

Problem: Users Don’t Know the Right Terms

# User query
query = "How do I login?"

# Documentation uses
actual_term = "authentication"

# Semantic similarity may be weak

Solution: Expand Query with Synonyms

class QueryExpander:
    """
    Expand query to include related terms.
    """

    def expand(self, query: str) -> str:
        # Use LLM to generate related terms
        expansion_prompt = f"""
        Given this search query, list related technical terms and synonyms.

        Query: {query}

        Related terms (comma-separated):
        """

        related_terms = llm.generate(expansion_prompt)

        # Combine original + expanded
        expanded = f"{query} {related_terms}"
        return expanded

# Example
original = "login issues"
expanded = "login issues authentication sign-in auth errors credentials"

# Broader semantic coverage

Hypothetical Document Embedding (HyDE)

class HyDERetriever:
    """
    Generate hypothetical answer, then search for similar documents.
    """

    def retrieve(self, query: str, top_k: int = 5):
        # Step 1: Generate hypothetical answer
        prompt = f"""
        Answer this question with a detailed technical response:
        {query}
        """
        hypothetical_doc = llm.generate(prompt)

        # Step 2: Search using hypothetical answer
        # (More likely to match actual documents)
        results = self.vector_db.search(
            query=hypothetical_doc,
            top_k=top_k
        )

        return results

# Why this works:
# Query: "How do I reset password?"
# Hypothetical: "To reset your password, navigate to Settings, click Security, then Reset Password..."
# This matches documentation better than the short query

Retrieval Strategy 3: Metadata Filtering

class MetadataFilteredRetriever:
    """
    Filter by metadata before or after vector search.
    """

    def retrieve(self, query: str, filters: dict = None, top_k: int = 5):
        # Pre-filter by metadata
        if filters:
            candidate_docs = self.filter_by_metadata(filters)
        else:
            candidate_docs = None

        # Vector search within filtered subset
        results = self.vector_db.search(
            query=query,
            filter=candidate_docs,
            top_k=top_k
        )

        return results

    def filter_by_metadata(self, filters: dict):
        """
        Example filters:
        {
            'category': 'api-documentation',
            'version': '2.0',
            'language': 'python',
            'last_updated': {'$gte': '2025-01-01'}
        }
        """
        return self.vector_db.filter(filters)

# Usage
retriever.retrieve(
    query="How do I authenticate?",
    filters={
        'category': 'authentication',
        'version': 'latest'
    }
)

Why metadata matters:

  • Scope search to relevant subset
  • Filter by date (get current docs)
  • Filter by type (code vs prose)
  • Filter by permission level (user access)

Enriching Chunks with Metadata

def create_enriched_chunks(doc: dict) -> list[dict]:
    """
    Add metadata that improves filtering.
    """
    chunks = chunk_document(doc['content'])

    enriched = []
    for chunk in chunks:
        enriched.append({
            'content': chunk,
            'metadata': {
                'doc_id': doc['id'],
                'title': doc['title'],
                'section': chunk.get('section'),
                'category': doc['category'],
                'tags': doc['tags'],
                'version': doc['version'],
                'last_updated': doc['last_updated'],
                'author': doc['author'],
                'content_type': detect_content_type(chunk),
                'language': detect_language(chunk),
                'word_count': len(chunk.split()),
                'has_code': contains_code_block(chunk)
            }
        })

    return enriched

Retrieval Strategy 4: Reranking

Two-Stage Retrieval

class RerankedRetriever:
    """
    Stage 1: Fast, broad retrieval
    Stage 2: Expensive, accurate reranking
    """

    def __init__(self, vector_db, reranker):
        self.vector_db = vector_db
        self.reranker = reranker  # Cross-encoder or LLM

    def retrieve(self, query: str, top_k: int = 5):
        # Stage 1: Retrieve more candidates (cheap)
        candidates = self.vector_db.search(
            query=query,
            top_k=top_k * 5  # Over-retrieve
        )

        # Stage 2: Rerank candidates (expensive)
        reranked = self.reranker.rank(
            query=query,
            documents=candidates
        )

        return reranked[:top_k]

# Why this works:
# - Vector search: Fast but approximate
# - Reranker: Slow but precise
# - Over-retrieve + rerank = best of both

Cross-Encoder Reranking

from sentence_transformers import CrossEncoder

class CrossEncoderReranker:
    """
    Score query-document pairs jointly.
    More accurate than separate embeddings.
    """

    def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
        self.model = CrossEncoder(model_name)

    def rank(self, query: str, documents: list[dict]) -> list[dict]:
        # Create query-document pairs
        pairs = [[query, doc['content']] for doc in documents]

        # Score all pairs
        scores = self.model.predict(pairs)

        # Sort by score
        for doc, score in zip(documents, scores):
            doc['rerank_score'] = score

        ranked = sorted(documents, key=lambda d: d['rerank_score'], reverse=True)
        return ranked

LLM-Based Reranking

class LLMReranker:
    """
    Use LLM to judge relevance.
    Most expensive but most accurate.
    """

    def rank(self, query: str, documents: list[dict], top_k: int = 5):
        ranking_prompt = f"""
        Query: {query}

        Documents:
        {self.format_docs(documents)}

        Task: Rank these documents by relevance to the query.
        Output only document IDs in order, most relevant first.
        Format: [id1, id2, id3, ...]
        """

        ranked_ids = llm.generate(ranking_prompt)
        ranked_ids = json.loads(ranked_ids)

        # Reorder documents
        doc_map = {doc['id']: doc for doc in documents}
        return [doc_map[doc_id] for doc_id in ranked_ids[:top_k]]

Retrieval Strategy 5: Contextual Retrieval

Using Conversation History

class ContextualRetriever:
    """
    Use conversation context to improve retrieval.
    """

    def __init__(self, vector_db):
        self.vector_db = vector_db
        self.conversation_history = []

    def retrieve(self, query: str, top_k: int = 5):
        # Rewrite query with context
        contextual_query = self.rewrite_with_context(
            query,
            self.conversation_history
        )

        # Search with contextual query
        results = self.vector_db.search(
            query=contextual_query,
            top_k=top_k
        )

        # Update history
        self.conversation_history.append({
            'query': query,
            'results': [r['id'] for r in results]
        })

        return results

    def rewrite_with_context(self, query: str, history: list):
        if not history:
            return query

        prompt = f"""
        Conversation history:
        {self.format_history(history[-3:])}  # Last 3 turns

        Current query: {query}

        Rewrite the current query to be self-contained,
        incorporating relevant context from history.

        Rewritten query:
        """

        return llm.generate(prompt)

# Example:
# User: "How do I create an API key?"
# Bot: [Explains API key creation]
# User: "Where do I use it?"  # Ambiguous
# Rewritten: "Where do I use the API key in my requests?"

Debugging Retrieval Failures

Step 1: Measure Retrieval Quality

class RetrievalEvaluator:
    """
    Quantify retrieval performance.
    """

    def evaluate(self, test_cases: list[dict]) -> dict:
        metrics = {
            'precision': [],
            'recall': [],
            'mrr': []  # Mean Reciprocal Rank
        }

        for case in test_cases:
            results = self.retriever.retrieve(case['query'], top_k=10)
            relevant_ids = set(case['relevant_doc_ids'])
            retrieved_ids = [r['id'] for r in results]

            # Precision: What fraction of retrieved docs are relevant?
            relevant_retrieved = set(retrieved_ids) & relevant_ids
            precision = len(relevant_retrieved) / len(retrieved_ids)
            metrics['precision'].append(precision)

            # Recall: What fraction of relevant docs were retrieved?
            recall = len(relevant_retrieved) / len(relevant_ids)
            metrics['recall'].append(recall)

            # MRR: Position of first relevant doc
            for rank, doc_id in enumerate(retrieved_ids, 1):
                if doc_id in relevant_ids:
                    metrics['mrr'].append(1 / rank)
                    break
            else:
                metrics['mrr'].append(0)

        return {
            'precision': mean(metrics['precision']),
            'recall': mean(metrics['recall']),
            'mrr': mean(metrics['mrr'])
        }

Step 2: Identify Failure Patterns

def analyze_failures(test_cases: list[dict], retriever):
    """
    Categorize retrieval failures.
    """
    failure_types = {
        'missing_relevant': [],  # Relevant docs not retrieved
        'low_ranking': [],       # Relevant docs ranked too low
        'irrelevant_noise': [],  # Too many irrelevant docs
        'query_mismatch': []     # Query-document vocabulary gap
    }

    for case in test_cases:
        results = retriever.retrieve(case['query'], top_k=10)
        relevant_ids = set(case['relevant_doc_ids'])
        retrieved_ids = [r['id'] for r in results]

        # Missing relevant docs?
        missing = relevant_ids - set(retrieved_ids)
        if missing:
            failure_types['missing_relevant'].append({
                'query': case['query'],
                'missing_docs': missing
            })

        # Relevant docs ranked low?
        for doc_id in relevant_ids:
            if doc_id in retrieved_ids:
                rank = retrieved_ids.index(doc_id) + 1
                if rank > 5:  # Should be in top 5
                    failure_types['low_ranking'].append({
                        'query': case['query'],
                        'doc_id': doc_id,
                        'rank': rank
                    })

        # Too many irrelevant docs in top results?
        top_5 = set(retrieved_ids[:5])
        irrelevant_in_top = top_5 - relevant_ids
        if len(irrelevant_in_top) > 3:
            failure_types['irrelevant_noise'].append({
                'query': case['query'],
                'irrelevant_count': len(irrelevant_in_top)
            })

    return failure_types

Step 3: Test Retrieval Strategies

def compare_strategies(test_cases: list[dict]):
    """
    Compare different retrieval approaches.
    """
    strategies = {
        'baseline': VectorRetriever(vector_db),
        'hybrid': HybridRetriever(vector_db, keyword_index),
        'reranked': RerankedRetriever(vector_db, reranker),
        'contextual': ContextualRetriever(vector_db)
    }

    results = {}
    for name, strategy in strategies.items():
        evaluator = RetrievalEvaluator(strategy)
        metrics = evaluator.evaluate(test_cases)
        results[name] = metrics

    # Compare
    print("Retrieval Strategy Comparison:")
    for name, metrics in results.items():
        print(f"{name}:")
        print(f"  Precision: {metrics['precision']:.3f}")
        print(f"  Recall: {metrics['recall']:.3f}")
        print(f"  MRR: {metrics['mrr']:.3f}")

    return results

Production Retrieval Pipeline

End-to-End Implementation

class ProductionRetriever:
    """
    Production-grade retrieval with fallbacks and monitoring.
    """

    def __init__(self, config: dict):
        self.vector_db = VectorDB(config['vector_db'])
        self.keyword_index = KeywordIndex(config['keyword_index'])
        self.reranker = CrossEncoderReranker(config['reranker_model'])
        self.metadata_filters = config.get('metadata_filters', {})

    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        filters: dict = None,
        context: list = None
    ) -> list[dict]:
        """
        Multi-stage retrieval with monitoring.
        """
        start_time = time.time()

        try:
            # Stage 1: Query preprocessing
            processed_query = self.preprocess_query(query, context)

            # Stage 2: Hybrid retrieval (over-retrieve)
            candidates = self.hybrid_search(
                processed_query,
                top_k=top_k * 5,
                filters=filters
            )

            if not candidates:
                # Fallback: Relax filters
                candidates = self.hybrid_search(
                    processed_query,
                    top_k=top_k * 5,
                    filters=None
                )

            # Stage 3: Rerank
            reranked = self.reranker.rank(processed_query, candidates)

            # Stage 4: Post-processing
            results = self.postprocess_results(reranked[:top_k])

            # Log metrics
            self.log_retrieval(
                query=query,
                results=results,
                latency_ms=(time.time() - start_time) * 1000
            )

            return results

        except Exception as e:
            self.log_error(query, e)
            return self.fallback_retrieval(query, top_k)

    def hybrid_search(self, query: str, top_k: int, filters: dict):
        # Vector search
        dense = self.vector_db.search(query, top_k, filters)

        # Keyword search
        sparse = self.keyword_index.search(query, top_k, filters)

        # Merge
        return self.merge_results(dense, sparse)

    def fallback_retrieval(self, query: str, top_k: int):
        """
        Simple fallback when primary retrieval fails.
        """
        return self.keyword_index.search(query, top_k)

    def log_retrieval(self, query: str, results: list, latency_ms: float):
        """
        Track retrieval metrics for monitoring.
        """
        log_event({
            'type': 'retrieval',
            'query': query,
            'num_results': len(results),
            'latency_ms': latency_ms,
            'top_score': results[0]['score'] if results else 0,
            'timestamp': datetime.now()
        })

Conclusion

Retrieval is where RAG succeeds or fails.

Key principles:

  1. Embeddings alone are insufficient: Combine dense and sparse search
  2. One-stage retrieval is brittle: Use multi-stage pipelines
  3. Context matters: Rewrite queries, use metadata, incorporate history
  4. Measure continuously: Track precision, recall, and ranking quality
  5. Have fallbacks: Retrieval will fail; handle it gracefully

Most engineering effort in RAG should focus on retrieval, not generation. The model can only work with what retrieval gives it.

Get retrieval right, and RAG works. Get retrieval wrong, and no amount of prompt engineering will save it.

Continue learning

Next in this path

Evaluating RAG Quality: Precision, Recall, and Faithfulness

Without evaluation, RAG systems cannot improve reliably. This article introduces practical metrics and evaluation strategies for measuring retrieval accuracy, answer grounding, and regression over time.