Evaluating RAG Quality: Precision, Recall, and Faithfulness

— Without evaluation, RAG systems cannot improve reliably. This article introduces practical metrics and evaluation strategies for measuring retrieval accuracy, answer grounding, and regression over time.

level: advanced topics: rag tags: rag, evaluation, metrics, llmops, production

Why “It Looks Good” Is Not Enough

After building a RAG system, engineers often test it like this:

# Manual testing
query = "How do I reset my password?"
answer = rag_system.query(query)
print(answer)

# Developer: "Looks good to me!"

This is not evaluation. This is a demo.

Production RAG systems need:

  1. Quantitative metrics: Measure performance objectively
  2. Regression detection: Know when changes break things
  3. Component-level insight: Identify where failures occur
  4. Continuous monitoring: Track quality over time

This article covers:

  • Metrics for retrieval quality
  • Metrics for generation quality
  • Building evaluation pipelines
  • Monitoring RAG in production

Evaluation Framework Overview

RAG Has Three Failure Modes

# Mode 1: Retrieval failure
query = "How do I authenticate?"
retrieved_docs = []  # Nothing retrieved
# Model cannot answer without context

# Mode 2: Ranking failure
retrieved_docs = [irrelevant_1, irrelevant_2, relevant_doc, ...]
# Relevant doc ranked too low, not included in context

# Mode 3: Generation failure
retrieved_docs = [relevant_doc_1, relevant_doc_2]
# Model generates answer not grounded in retrieved docs

Evaluation must cover all three modes.


Part 1: Retrieval Metrics

Metric 1: Precision

Definition: What fraction of retrieved documents are relevant?

def precision_at_k(retrieved: list[str], relevant: set[str]) -> float:
    """
    Precision@K: Fraction of retrieved docs that are relevant.
    """
    retrieved_set = set(retrieved)
    relevant_retrieved = retrieved_set & relevant

    return len(relevant_retrieved) / len(retrieved) if retrieved else 0

# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc2', 'doc5', 'doc7'}

precision = precision_at_k(retrieved, relevant)
# 2 relevant out of 5 retrieved = 0.4

What it tells you:

  • High precision: Few irrelevant results
  • Low precision: Too much noise

When to optimize:

  • Limited context window
  • High cost per retrieved document

Metric 2: Recall

Definition: What fraction of relevant documents were retrieved?

def recall_at_k(retrieved: list[str], relevant: set[str]) -> float:
    """
    Recall@K: Fraction of relevant docs that were retrieved.
    """
    retrieved_set = set(retrieved)
    relevant_retrieved = retrieved_set & relevant

    return len(relevant_retrieved) / len(relevant) if relevant else 0

# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc2', 'doc5', 'doc7'}

recall = recall_at_k(retrieved, relevant)
# 2 relevant out of 3 total relevant = 0.67

What it tells you:

  • High recall: Found most relevant documents
  • Low recall: Missing important information

When to optimize:

  • Completeness critical
  • Can tolerate some noise

Metric 3: Mean Reciprocal Rank (MRR)

Definition: Average of reciprocal ranks of first relevant result.

def mean_reciprocal_rank(results: list[list[str]], relevant_sets: list[set[str]]) -> float:
    """
    MRR: How high is the first relevant result ranked?
    """
    reciprocal_ranks = []

    for retrieved, relevant in zip(results, relevant_sets):
        for rank, doc_id in enumerate(retrieved, start=1):
            if doc_id in relevant:
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)  # No relevant doc found

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

# Example
results = [
    ['doc1', 'doc2', 'doc3'],  # First relevant at position 2
    ['doc4', 'doc5', 'doc6'],  # First relevant at position 1
    ['doc7', 'doc8', 'doc9']   # No relevant doc
]
relevant_sets = [
    {'doc2', 'doc5'},
    {'doc4'},
    {'doc10'}
]

mrr = mean_reciprocal_rank(results, relevant_sets)
# (1/2 + 1/1 + 0) / 3 = 0.5

What it tells you:

  • High MRR: Relevant docs ranked high
  • Low MRR: Relevant docs ranked low or missing

When to optimize:

  • User experience depends on top results
  • Top-k context window limit

Metric 4: Normalized Discounted Cumulative Gain (NDCG)

Definition: Measures ranking quality with graded relevance.

import numpy as np

def dcg_at_k(relevance_scores: list[float], k: int) -> float:
    """
    DCG: Weighted sum of relevance scores.
    Higher-ranked docs have more weight.
    """
    relevance_scores = np.array(relevance_scores[:k])
    if relevance_scores.size == 0:
        return 0.0

    # Discount by log of position
    discounts = np.log2(np.arange(2, relevance_scores.size + 2))
    return np.sum(relevance_scores / discounts)

def ndcg_at_k(relevance_scores: list[float], k: int) -> float:
    """
    NDCG: DCG normalized by ideal DCG.
    Score between 0 and 1.
    """
    dcg = dcg_at_k(relevance_scores, k)

    # Ideal: Sort by relevance (best possible ranking)
    ideal_scores = sorted(relevance_scores, reverse=True)
    idcg = dcg_at_k(ideal_scores, k)

    return dcg / idcg if idcg > 0 else 0.0

# Example
# Query: "API authentication"
# Retrieved docs with relevance scores (3=perfect, 2=good, 1=partial, 0=irrelevant)
retrieved_relevance = [1, 3, 0, 2, 0]  # Retrieved ranking
# Position 2 is highly relevant but ranked low

ndcg = ndcg_at_k(retrieved_relevance, k=5)
# Lower than 1.0 because perfect doc ranked 2nd instead of 1st

What it tells you:

  • NDCG=1.0: Perfect ranking
  • Lower NDCG: Relevant docs ranked suboptimally

When to optimize:

  • Multiple levels of relevance
  • Ranking order matters

Part 2: Generation Metrics

Metric 1: Faithfulness (Answer Grounding)

Definition: Is the generated answer supported by retrieved documents?

class FaithfulnessEvaluator:
    """
    Measure if answer is grounded in retrieved context.
    """

    def evaluate(self, answer: str, retrieved_docs: list[str]) -> dict:
        # Extract claims from answer
        claims = self.extract_claims(answer)

        # Check each claim against retrieved docs
        results = {
            'total_claims': len(claims),
            'supported_claims': 0,
            'unsupported_claims': []
        }

        for claim in claims:
            if self.is_supported(claim, retrieved_docs):
                results['supported_claims'] += 1
            else:
                results['unsupported_claims'].append(claim)

        results['faithfulness_score'] = (
            results['supported_claims'] / results['total_claims']
            if results['total_claims'] > 0 else 0
        )

        return results

    def extract_claims(self, answer: str) -> list[str]:
        """
        Use LLM to break answer into atomic claims.
        """
        prompt = f"""
        Break this answer into individual factual claims.
        Each claim should be a single, verifiable statement.

        Answer: {answer}

        Claims (one per line):
        """
        claims_text = llm.generate(prompt)
        return [c.strip() for c in claims_text.split('\n') if c.strip()]

    def is_supported(self, claim: str, docs: list[str]) -> bool:
        """
        Check if claim is supported by documents.
        """
        context = '\n\n'.join(docs)

        verification_prompt = f"""
        Context:
        {context}

        Claim: {claim}

        Is this claim directly supported by the context?
        Answer only: YES or NO
        """

        result = llm.generate(verification_prompt).strip().upper()
        return result == 'YES'

# Example usage
answer = "The API rate limit is 1000 requests per hour, and it costs $0.01 per request."
retrieved_docs = ["API rate limit: 1000 req/hour"]

evaluator = FaithfulnessEvaluator()
faithfulness = evaluator.evaluate(answer, retrieved_docs)

# Result:
# {
#   'total_claims': 2,
#   'supported_claims': 1,  # Rate limit claim supported
#   'unsupported_claims': ['costs $0.01 per request'],  # Hallucinated
#   'faithfulness_score': 0.5
# }

What it tells you:

  • Low faithfulness: Model hallucinating
  • High faithfulness: Answers grounded in context

Metric 2: Relevance (Answer Completeness)

Definition: Does the answer address the question?

class RelevanceEvaluator:
    """
    Measure if answer actually addresses the question.
    """

    def evaluate(self, query: str, answer: str) -> float:
        prompt = f"""
        Query: {query}
        Answer: {answer}

        Does this answer fully address the query?
        Rate from 1-5:
        5 = Completely addresses query
        4 = Addresses query with minor gaps
        3 = Partially addresses query
        2 = Tangentially related
        1 = Does not address query

        Rating (1-5):
        """

        rating = llm.generate(prompt).strip()
        try:
            return int(rating) / 5.0  # Normalize to 0-1
        except ValueError:
            return 0.0

# Example
query = "How do I authenticate with the API?"
answer = "The API uses OAuth 2.0. Contact support for credentials."

relevance = RelevanceEvaluator().evaluate(query, answer)
# Might score 3/5: Mentions auth method but lacks implementation details

Metric 3: Context Precision

Definition: What fraction of retrieved context is actually useful for answering?

def context_precision(query: str, answer: str, retrieved_docs: list[str]) -> float:
    """
    How many retrieved docs were actually needed?
    """
    useful_docs = 0

    for doc in retrieved_docs:
        # Check if this doc contributed to the answer
        prompt = f"""
        Query: {query}
        Answer: {answer}
        Document: {doc}

        Was this document useful for generating the answer?
        Answer: YES or NO
        """

        result = llm.generate(prompt).strip().upper()
        if result == 'YES':
            useful_docs += 1

    return useful_docs / len(retrieved_docs) if retrieved_docs else 0

# High context precision = retrieval is efficient
# Low context precision = retrieving too much noise

Metric 4: Context Recall

Definition: Did retrieval capture all information needed to answer?

def context_recall(query: str, ground_truth: str, retrieved_docs: list[str]) -> float:
    """
    Can the ground truth answer be generated from retrieved docs?
    """
    context = '\n\n'.join(retrieved_docs)

    prompt = f"""
    Query: {query}
    Ground truth answer: {ground_truth}

    Context:
    {context}

    What fraction of the ground truth answer can be supported by this context?
    Answer with a number between 0.0 (none) and 1.0 (all).

    Fraction:
    """

    result = llm.generate(prompt).strip()
    try:
        return float(result)
    except ValueError:
        return 0.0

# High context recall = retrieval captured necessary info
# Low context recall = missing key documents

Part 3: Building an Evaluation Pipeline

Creating Test Cases

class RAGTestCase(BaseModel):
    """
    Single evaluation test case.
    """
    query: str
    ground_truth_answer: str
    relevant_doc_ids: set[str]
    metadata: dict = {}

def create_test_suite() -> list[RAGTestCase]:
    """
    Curate evaluation test cases.
    """
    return [
        RAGTestCase(
            query="How do I reset my password?",
            ground_truth_answer="Navigate to Settings > Security > Reset Password...",
            relevant_doc_ids={'doc_123', 'doc_456'}
        ),
        RAGTestCase(
            query="What is the API rate limit?",
            ground_truth_answer="1000 requests per hour for free tier",
            relevant_doc_ids={'doc_789'}
        ),
        # Add 50-100 diverse test cases
    ]

Evaluation Runner

class RAGEvaluator:
    """
    Run comprehensive evaluation on RAG system.
    """

    def __init__(self, rag_system, test_cases: list[RAGTestCase]):
        self.rag_system = rag_system
        self.test_cases = test_cases

    def evaluate_all(self) -> dict:
        """
        Run full evaluation suite.
        """
        results = {
            'retrieval_metrics': self.evaluate_retrieval(),
            'generation_metrics': self.evaluate_generation(),
            'end_to_end_metrics': self.evaluate_end_to_end()
        }

        return results

    def evaluate_retrieval(self) -> dict:
        """
        Evaluate retrieval quality.
        """
        precisions = []
        recalls = []
        mrrs = []

        for case in self.test_cases:
            retrieved = self.rag_system.retrieve(case.query, top_k=5)
            retrieved_ids = [r['id'] for r in retrieved]

            # Calculate metrics
            precision = precision_at_k(retrieved_ids, case.relevant_doc_ids)
            recall = recall_at_k(retrieved_ids, case.relevant_doc_ids)

            precisions.append(precision)
            recalls.append(recall)

            # MRR
            for rank, doc_id in enumerate(retrieved_ids, start=1):
                if doc_id in case.relevant_doc_ids:
                    mrrs.append(1 / rank)
                    break
            else:
                mrrs.append(0)

        return {
            'precision@5': mean(precisions),
            'recall@5': mean(recalls),
            'mrr': mean(mrrs)
        }

    def evaluate_generation(self) -> dict:
        """
        Evaluate generation quality.
        """
        faithfulness_scores = []
        relevance_scores = []

        for case in self.test_cases:
            # Get RAG system output
            result = self.rag_system.query(case.query)
            answer = result['answer']
            retrieved_docs = [r['content'] for r in result['retrieved_docs']]

            # Faithfulness
            faith_eval = FaithfulnessEvaluator()
            faith_result = faith_eval.evaluate(answer, retrieved_docs)
            faithfulness_scores.append(faith_result['faithfulness_score'])

            # Relevance
            rel_eval = RelevanceEvaluator()
            relevance_scores.append(rel_eval.evaluate(case.query, answer))

        return {
            'faithfulness': mean(faithfulness_scores),
            'relevance': mean(relevance_scores)
        }

    def evaluate_end_to_end(self) -> dict:
        """
        End-to-end evaluation comparing to ground truth.
        """
        exact_matches = 0
        semantic_similarities = []

        for case in self.test_cases:
            result = self.rag_system.query(case.query)
            answer = result['answer']

            # Exact match (rare but possible)
            if answer.strip().lower() == case.ground_truth_answer.strip().lower():
                exact_matches += 1

            # Semantic similarity
            similarity = compute_semantic_similarity(
                answer,
                case.ground_truth_answer
            )
            semantic_similarities.append(similarity)

        return {
            'exact_match': exact_matches / len(self.test_cases),
            'semantic_similarity': mean(semantic_similarities)
        }

# Run evaluation
evaluator = RAGEvaluator(rag_system, test_cases)
results = evaluator.evaluate_all()

print(f"Retrieval Precision: {results['retrieval_metrics']['precision@5']:.3f}")
print(f"Faithfulness: {results['generation_metrics']['faithfulness']:.3f}")

Part 4: Automated Evaluation with LLMs

Using LLM-as-Judge

class LLMJudge:
    """
    Use LLM to evaluate RAG quality.
    Faster and cheaper than human evaluation.
    """

    def evaluate_answer_quality(
        self,
        query: str,
        answer: str,
        retrieved_docs: list[str],
        ground_truth: str = None
    ) -> dict:
        """
        Multi-aspect LLM-based evaluation.
        """
        context = '\n\n'.join(retrieved_docs)

        eval_prompt = f"""
        Evaluate this RAG system output.

        Query: {query}
        Retrieved Context:
        {context}

        Generated Answer: {answer}
        {f'Ground Truth: {ground_truth}' if ground_truth else ''}

        Evaluate on these dimensions (score 1-5 for each):

        1. Faithfulness: Is the answer supported by the retrieved context?
        2. Relevance: Does the answer address the query?
        3. Completeness: Does the answer fully address the query?
        4. Conciseness: Is the answer appropriately concise?

        Provide scores and brief justification for each.

        Output format (JSON):
        {{
            "faithfulness": {{"score": 1-5, "justification": "..."}},
            "relevance": {{"score": 1-5, "justification": "..."}},
            "completeness": {{"score": 1-5, "justification": "..."}},
            "conciseness": {{"score": 1-5, "justification": "..."}}
        }}
        """

        result = llm.generate(eval_prompt)
        return json.loads(result)

# Example usage
judge = LLMJudge()
scores = judge.evaluate_answer_quality(
    query="How do I authenticate?",
    answer="Use OAuth 2.0 with client credentials.",
    retrieved_docs=["API uses OAuth 2.0..."],
    ground_truth="Configure OAuth 2.0 with client ID and secret..."
)

print(f"Faithfulness: {scores['faithfulness']['score']}/5")
print(f"Relevance: {scores['relevance']['score']}/5")

Pairwise Comparison

def compare_rag_versions(query: str, answer_a: str, answer_b: str) -> str:
    """
    Compare two RAG system versions.
    Often more reliable than absolute scoring.
    """
    prompt = f"""
    Query: {query}

    Answer A: {answer_a}
    Answer B: {answer_b}

    Which answer is better?
    Consider:
    - Accuracy
    - Completeness
    - Clarity

    Output: "A", "B", or "TIE"
    """

    return llm.generate(prompt).strip()

# A/B testing RAG improvements
wins = {'A': 0, 'B': 0, 'TIE': 0}

for case in test_cases:
    answer_a = rag_v1.query(case.query)
    answer_b = rag_v2.query(case.query)

    winner = compare_rag_versions(case.query, answer_a, answer_b)
    wins[winner] += 1

print(f"Version A: {wins['A']} wins")
print(f"Version B: {wins['B']} wins")
print(f"Ties: {wins['TIE']}")

Part 5: Production Monitoring

Real-Time Metrics

class RAGMonitor:
    """
    Track RAG performance in production.
    """

    def __init__(self):
        self.metrics_buffer = []

    def log_query(
        self,
        query: str,
        answer: str,
        retrieved_docs: list[dict],
        latency_ms: float,
        user_feedback: str = None
    ):
        """
        Log every RAG query for monitoring.
        """
        metric = {
            'timestamp': datetime.now(),
            'query_length': len(query.split()),
            'answer_length': len(answer.split()),
            'num_docs_retrieved': len(retrieved_docs),
            'latency_ms': latency_ms,
            'user_feedback': user_feedback,  # thumbs up/down
            'retrieval_scores': [d['score'] for d in retrieved_docs]
        }

        self.metrics_buffer.append(metric)

        # Alert on anomalies
        self.check_anomalies(metric)

    def check_anomalies(self, metric: dict):
        """
        Detect potential issues.
        """
        # High latency
        if metric['latency_ms'] > 5000:
            alert('High latency', metric)

        # No documents retrieved
        if metric['num_docs_retrieved'] == 0:
            alert('Retrieval failure', metric)

        # Low retrieval scores
        if metric['retrieval_scores'] and max(metric['retrieval_scores']) < 0.5:
            alert('Low relevance scores', metric)

    def generate_report(self, time_window: str = '24h') -> dict:
        """
        Aggregate metrics over time window.
        """
        recent_metrics = filter_by_time(self.metrics_buffer, time_window)

        return {
            'total_queries': len(recent_metrics),
            'avg_latency_ms': mean([m['latency_ms'] for m in recent_metrics]),
            'avg_docs_retrieved': mean([m['num_docs_retrieved'] for m in recent_metrics]),
            'positive_feedback_rate': sum(
                1 for m in recent_metrics
                if m['user_feedback'] == 'positive'
            ) / len(recent_metrics),
            'zero_results_rate': sum(
                1 for m in recent_metrics
                if m['num_docs_retrieved'] == 0
            ) / len(recent_metrics)
        }

Regression Detection

class RegressionDetector:
    """
    Detect when RAG quality degrades.
    """

    def __init__(self, test_cases: list[RAGTestCase]):
        self.test_cases = test_cases
        self.baseline_metrics = None

    def establish_baseline(self, rag_system):
        """
        Run evaluation and save as baseline.
        """
        evaluator = RAGEvaluator(rag_system, self.test_cases)
        self.baseline_metrics = evaluator.evaluate_all()

    def detect_regression(self, rag_system, threshold: float = 0.05):
        """
        Check if current performance dropped significantly.
        """
        evaluator = RAGEvaluator(rag_system, self.test_cases)
        current_metrics = evaluator.evaluate_all()

        regressions = []

        # Compare retrieval metrics
        for metric in ['precision@5', 'recall@5', 'mrr']:
            baseline = self.baseline_metrics['retrieval_metrics'][metric]
            current = current_metrics['retrieval_metrics'][metric]
            delta = baseline - current

            if delta > threshold:
                regressions.append({
                    'metric': metric,
                    'baseline': baseline,
                    'current': current,
                    'delta': delta
                })

        # Compare generation metrics
        for metric in ['faithfulness', 'relevance']:
            baseline = self.baseline_metrics['generation_metrics'][metric]
            current = current_metrics['generation_metrics'][metric]
            delta = baseline - current

            if delta > threshold:
                regressions.append({
                    'metric': metric,
                    'baseline': baseline,
                    'current': current,
                    'delta': delta
                })

        return regressions

# Usage in CI/CD
detector = RegressionDetector(test_cases)
detector.establish_baseline(rag_system_v1)

# Before deploying v2
regressions = detector.detect_regression(rag_system_v2)

if regressions:
    print("REGRESSION DETECTED:")
    for reg in regressions:
        print(f"{reg['metric']}: {reg['baseline']:.3f}{reg['current']:.3f}")
    raise Exception("Cannot deploy: performance regression")

Conclusion

Evaluation is not optional for production RAG.

Key practices:

  1. Measure retrieval separately: Precision, recall, MRR, NDCG
  2. Measure generation quality: Faithfulness, relevance, completeness
  3. Automate evaluation: Use LLM-as-judge for scalability
  4. Monitor continuously: Track metrics in production
  5. Detect regressions: Prevent quality degradation

Build evaluation infrastructure early. Without it, you are optimizing blindly.

The goal is not perfect scores. The goal is measurable improvement over time.

Continue learning

Intentional links