Evaluating RAG Quality: Precision, Recall, and Faithfulness
— Without evaluation, RAG systems cannot improve reliably. This article introduces practical metrics and evaluation strategies for measuring retrieval accuracy, answer grounding, and regression over time.
Why “It Looks Good” Is Not Enough
After building a RAG system, engineers often test it like this:
# Manual testing
query = "How do I reset my password?"
answer = rag_system.query(query)
print(answer)
# Developer: "Looks good to me!"
This is not evaluation. This is a demo.
Production RAG systems need:
- Quantitative metrics: Measure performance objectively
- Regression detection: Know when changes break things
- Component-level insight: Identify where failures occur
- Continuous monitoring: Track quality over time
This article covers:
- Metrics for retrieval quality
- Metrics for generation quality
- Building evaluation pipelines
- Monitoring RAG in production
Evaluation Framework Overview
RAG Has Three Failure Modes
# Mode 1: Retrieval failure
query = "How do I authenticate?"
retrieved_docs = [] # Nothing retrieved
# Model cannot answer without context
# Mode 2: Ranking failure
retrieved_docs = [irrelevant_1, irrelevant_2, relevant_doc, ...]
# Relevant doc ranked too low, not included in context
# Mode 3: Generation failure
retrieved_docs = [relevant_doc_1, relevant_doc_2]
# Model generates answer not grounded in retrieved docs
Evaluation must cover all three modes.
Part 1: Retrieval Metrics
Metric 1: Precision
Definition: What fraction of retrieved documents are relevant?
def precision_at_k(retrieved: list[str], relevant: set[str]) -> float:
"""
Precision@K: Fraction of retrieved docs that are relevant.
"""
retrieved_set = set(retrieved)
relevant_retrieved = retrieved_set & relevant
return len(relevant_retrieved) / len(retrieved) if retrieved else 0
# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc2', 'doc5', 'doc7'}
precision = precision_at_k(retrieved, relevant)
# 2 relevant out of 5 retrieved = 0.4
What it tells you:
- High precision: Few irrelevant results
- Low precision: Too much noise
When to optimize:
- Limited context window
- High cost per retrieved document
Metric 2: Recall
Definition: What fraction of relevant documents were retrieved?
def recall_at_k(retrieved: list[str], relevant: set[str]) -> float:
"""
Recall@K: Fraction of relevant docs that were retrieved.
"""
retrieved_set = set(retrieved)
relevant_retrieved = retrieved_set & relevant
return len(relevant_retrieved) / len(relevant) if relevant else 0
# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc2', 'doc5', 'doc7'}
recall = recall_at_k(retrieved, relevant)
# 2 relevant out of 3 total relevant = 0.67
What it tells you:
- High recall: Found most relevant documents
- Low recall: Missing important information
When to optimize:
- Completeness critical
- Can tolerate some noise
Metric 3: Mean Reciprocal Rank (MRR)
Definition: Average of reciprocal ranks of first relevant result.
def mean_reciprocal_rank(results: list[list[str]], relevant_sets: list[set[str]]) -> float:
"""
MRR: How high is the first relevant result ranked?
"""
reciprocal_ranks = []
for retrieved, relevant in zip(results, relevant_sets):
for rank, doc_id in enumerate(retrieved, start=1):
if doc_id in relevant:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0) # No relevant doc found
return sum(reciprocal_ranks) / len(reciprocal_ranks)
# Example
results = [
['doc1', 'doc2', 'doc3'], # First relevant at position 2
['doc4', 'doc5', 'doc6'], # First relevant at position 1
['doc7', 'doc8', 'doc9'] # No relevant doc
]
relevant_sets = [
{'doc2', 'doc5'},
{'doc4'},
{'doc10'}
]
mrr = mean_reciprocal_rank(results, relevant_sets)
# (1/2 + 1/1 + 0) / 3 = 0.5
What it tells you:
- High MRR: Relevant docs ranked high
- Low MRR: Relevant docs ranked low or missing
When to optimize:
- User experience depends on top results
- Top-k context window limit
Metric 4: Normalized Discounted Cumulative Gain (NDCG)
Definition: Measures ranking quality with graded relevance.
import numpy as np
def dcg_at_k(relevance_scores: list[float], k: int) -> float:
"""
DCG: Weighted sum of relevance scores.
Higher-ranked docs have more weight.
"""
relevance_scores = np.array(relevance_scores[:k])
if relevance_scores.size == 0:
return 0.0
# Discount by log of position
discounts = np.log2(np.arange(2, relevance_scores.size + 2))
return np.sum(relevance_scores / discounts)
def ndcg_at_k(relevance_scores: list[float], k: int) -> float:
"""
NDCG: DCG normalized by ideal DCG.
Score between 0 and 1.
"""
dcg = dcg_at_k(relevance_scores, k)
# Ideal: Sort by relevance (best possible ranking)
ideal_scores = sorted(relevance_scores, reverse=True)
idcg = dcg_at_k(ideal_scores, k)
return dcg / idcg if idcg > 0 else 0.0
# Example
# Query: "API authentication"
# Retrieved docs with relevance scores (3=perfect, 2=good, 1=partial, 0=irrelevant)
retrieved_relevance = [1, 3, 0, 2, 0] # Retrieved ranking
# Position 2 is highly relevant but ranked low
ndcg = ndcg_at_k(retrieved_relevance, k=5)
# Lower than 1.0 because perfect doc ranked 2nd instead of 1st
What it tells you:
- NDCG=1.0: Perfect ranking
- Lower NDCG: Relevant docs ranked suboptimally
When to optimize:
- Multiple levels of relevance
- Ranking order matters
Part 2: Generation Metrics
Metric 1: Faithfulness (Answer Grounding)
Definition: Is the generated answer supported by retrieved documents?
class FaithfulnessEvaluator:
"""
Measure if answer is grounded in retrieved context.
"""
def evaluate(self, answer: str, retrieved_docs: list[str]) -> dict:
# Extract claims from answer
claims = self.extract_claims(answer)
# Check each claim against retrieved docs
results = {
'total_claims': len(claims),
'supported_claims': 0,
'unsupported_claims': []
}
for claim in claims:
if self.is_supported(claim, retrieved_docs):
results['supported_claims'] += 1
else:
results['unsupported_claims'].append(claim)
results['faithfulness_score'] = (
results['supported_claims'] / results['total_claims']
if results['total_claims'] > 0 else 0
)
return results
def extract_claims(self, answer: str) -> list[str]:
"""
Use LLM to break answer into atomic claims.
"""
prompt = f"""
Break this answer into individual factual claims.
Each claim should be a single, verifiable statement.
Answer: {answer}
Claims (one per line):
"""
claims_text = llm.generate(prompt)
return [c.strip() for c in claims_text.split('\n') if c.strip()]
def is_supported(self, claim: str, docs: list[str]) -> bool:
"""
Check if claim is supported by documents.
"""
context = '\n\n'.join(docs)
verification_prompt = f"""
Context:
{context}
Claim: {claim}
Is this claim directly supported by the context?
Answer only: YES or NO
"""
result = llm.generate(verification_prompt).strip().upper()
return result == 'YES'
# Example usage
answer = "The API rate limit is 1000 requests per hour, and it costs $0.01 per request."
retrieved_docs = ["API rate limit: 1000 req/hour"]
evaluator = FaithfulnessEvaluator()
faithfulness = evaluator.evaluate(answer, retrieved_docs)
# Result:
# {
# 'total_claims': 2,
# 'supported_claims': 1, # Rate limit claim supported
# 'unsupported_claims': ['costs $0.01 per request'], # Hallucinated
# 'faithfulness_score': 0.5
# }
What it tells you:
- Low faithfulness: Model hallucinating
- High faithfulness: Answers grounded in context
Metric 2: Relevance (Answer Completeness)
Definition: Does the answer address the question?
class RelevanceEvaluator:
"""
Measure if answer actually addresses the question.
"""
def evaluate(self, query: str, answer: str) -> float:
prompt = f"""
Query: {query}
Answer: {answer}
Does this answer fully address the query?
Rate from 1-5:
5 = Completely addresses query
4 = Addresses query with minor gaps
3 = Partially addresses query
2 = Tangentially related
1 = Does not address query
Rating (1-5):
"""
rating = llm.generate(prompt).strip()
try:
return int(rating) / 5.0 # Normalize to 0-1
except ValueError:
return 0.0
# Example
query = "How do I authenticate with the API?"
answer = "The API uses OAuth 2.0. Contact support for credentials."
relevance = RelevanceEvaluator().evaluate(query, answer)
# Might score 3/5: Mentions auth method but lacks implementation details
Metric 3: Context Precision
Definition: What fraction of retrieved context is actually useful for answering?
def context_precision(query: str, answer: str, retrieved_docs: list[str]) -> float:
"""
How many retrieved docs were actually needed?
"""
useful_docs = 0
for doc in retrieved_docs:
# Check if this doc contributed to the answer
prompt = f"""
Query: {query}
Answer: {answer}
Document: {doc}
Was this document useful for generating the answer?
Answer: YES or NO
"""
result = llm.generate(prompt).strip().upper()
if result == 'YES':
useful_docs += 1
return useful_docs / len(retrieved_docs) if retrieved_docs else 0
# High context precision = retrieval is efficient
# Low context precision = retrieving too much noise
Metric 4: Context Recall
Definition: Did retrieval capture all information needed to answer?
def context_recall(query: str, ground_truth: str, retrieved_docs: list[str]) -> float:
"""
Can the ground truth answer be generated from retrieved docs?
"""
context = '\n\n'.join(retrieved_docs)
prompt = f"""
Query: {query}
Ground truth answer: {ground_truth}
Context:
{context}
What fraction of the ground truth answer can be supported by this context?
Answer with a number between 0.0 (none) and 1.0 (all).
Fraction:
"""
result = llm.generate(prompt).strip()
try:
return float(result)
except ValueError:
return 0.0
# High context recall = retrieval captured necessary info
# Low context recall = missing key documents
Part 3: Building an Evaluation Pipeline
Creating Test Cases
class RAGTestCase(BaseModel):
"""
Single evaluation test case.
"""
query: str
ground_truth_answer: str
relevant_doc_ids: set[str]
metadata: dict = {}
def create_test_suite() -> list[RAGTestCase]:
"""
Curate evaluation test cases.
"""
return [
RAGTestCase(
query="How do I reset my password?",
ground_truth_answer="Navigate to Settings > Security > Reset Password...",
relevant_doc_ids={'doc_123', 'doc_456'}
),
RAGTestCase(
query="What is the API rate limit?",
ground_truth_answer="1000 requests per hour for free tier",
relevant_doc_ids={'doc_789'}
),
# Add 50-100 diverse test cases
]
Evaluation Runner
class RAGEvaluator:
"""
Run comprehensive evaluation on RAG system.
"""
def __init__(self, rag_system, test_cases: list[RAGTestCase]):
self.rag_system = rag_system
self.test_cases = test_cases
def evaluate_all(self) -> dict:
"""
Run full evaluation suite.
"""
results = {
'retrieval_metrics': self.evaluate_retrieval(),
'generation_metrics': self.evaluate_generation(),
'end_to_end_metrics': self.evaluate_end_to_end()
}
return results
def evaluate_retrieval(self) -> dict:
"""
Evaluate retrieval quality.
"""
precisions = []
recalls = []
mrrs = []
for case in self.test_cases:
retrieved = self.rag_system.retrieve(case.query, top_k=5)
retrieved_ids = [r['id'] for r in retrieved]
# Calculate metrics
precision = precision_at_k(retrieved_ids, case.relevant_doc_ids)
recall = recall_at_k(retrieved_ids, case.relevant_doc_ids)
precisions.append(precision)
recalls.append(recall)
# MRR
for rank, doc_id in enumerate(retrieved_ids, start=1):
if doc_id in case.relevant_doc_ids:
mrrs.append(1 / rank)
break
else:
mrrs.append(0)
return {
'precision@5': mean(precisions),
'recall@5': mean(recalls),
'mrr': mean(mrrs)
}
def evaluate_generation(self) -> dict:
"""
Evaluate generation quality.
"""
faithfulness_scores = []
relevance_scores = []
for case in self.test_cases:
# Get RAG system output
result = self.rag_system.query(case.query)
answer = result['answer']
retrieved_docs = [r['content'] for r in result['retrieved_docs']]
# Faithfulness
faith_eval = FaithfulnessEvaluator()
faith_result = faith_eval.evaluate(answer, retrieved_docs)
faithfulness_scores.append(faith_result['faithfulness_score'])
# Relevance
rel_eval = RelevanceEvaluator()
relevance_scores.append(rel_eval.evaluate(case.query, answer))
return {
'faithfulness': mean(faithfulness_scores),
'relevance': mean(relevance_scores)
}
def evaluate_end_to_end(self) -> dict:
"""
End-to-end evaluation comparing to ground truth.
"""
exact_matches = 0
semantic_similarities = []
for case in self.test_cases:
result = self.rag_system.query(case.query)
answer = result['answer']
# Exact match (rare but possible)
if answer.strip().lower() == case.ground_truth_answer.strip().lower():
exact_matches += 1
# Semantic similarity
similarity = compute_semantic_similarity(
answer,
case.ground_truth_answer
)
semantic_similarities.append(similarity)
return {
'exact_match': exact_matches / len(self.test_cases),
'semantic_similarity': mean(semantic_similarities)
}
# Run evaluation
evaluator = RAGEvaluator(rag_system, test_cases)
results = evaluator.evaluate_all()
print(f"Retrieval Precision: {results['retrieval_metrics']['precision@5']:.3f}")
print(f"Faithfulness: {results['generation_metrics']['faithfulness']:.3f}")
Part 4: Automated Evaluation with LLMs
Using LLM-as-Judge
class LLMJudge:
"""
Use LLM to evaluate RAG quality.
Faster and cheaper than human evaluation.
"""
def evaluate_answer_quality(
self,
query: str,
answer: str,
retrieved_docs: list[str],
ground_truth: str = None
) -> dict:
"""
Multi-aspect LLM-based evaluation.
"""
context = '\n\n'.join(retrieved_docs)
eval_prompt = f"""
Evaluate this RAG system output.
Query: {query}
Retrieved Context:
{context}
Generated Answer: {answer}
{f'Ground Truth: {ground_truth}' if ground_truth else ''}
Evaluate on these dimensions (score 1-5 for each):
1. Faithfulness: Is the answer supported by the retrieved context?
2. Relevance: Does the answer address the query?
3. Completeness: Does the answer fully address the query?
4. Conciseness: Is the answer appropriately concise?
Provide scores and brief justification for each.
Output format (JSON):
{{
"faithfulness": {{"score": 1-5, "justification": "..."}},
"relevance": {{"score": 1-5, "justification": "..."}},
"completeness": {{"score": 1-5, "justification": "..."}},
"conciseness": {{"score": 1-5, "justification": "..."}}
}}
"""
result = llm.generate(eval_prompt)
return json.loads(result)
# Example usage
judge = LLMJudge()
scores = judge.evaluate_answer_quality(
query="How do I authenticate?",
answer="Use OAuth 2.0 with client credentials.",
retrieved_docs=["API uses OAuth 2.0..."],
ground_truth="Configure OAuth 2.0 with client ID and secret..."
)
print(f"Faithfulness: {scores['faithfulness']['score']}/5")
print(f"Relevance: {scores['relevance']['score']}/5")
Pairwise Comparison
def compare_rag_versions(query: str, answer_a: str, answer_b: str) -> str:
"""
Compare two RAG system versions.
Often more reliable than absolute scoring.
"""
prompt = f"""
Query: {query}
Answer A: {answer_a}
Answer B: {answer_b}
Which answer is better?
Consider:
- Accuracy
- Completeness
- Clarity
Output: "A", "B", or "TIE"
"""
return llm.generate(prompt).strip()
# A/B testing RAG improvements
wins = {'A': 0, 'B': 0, 'TIE': 0}
for case in test_cases:
answer_a = rag_v1.query(case.query)
answer_b = rag_v2.query(case.query)
winner = compare_rag_versions(case.query, answer_a, answer_b)
wins[winner] += 1
print(f"Version A: {wins['A']} wins")
print(f"Version B: {wins['B']} wins")
print(f"Ties: {wins['TIE']}")
Part 5: Production Monitoring
Real-Time Metrics
class RAGMonitor:
"""
Track RAG performance in production.
"""
def __init__(self):
self.metrics_buffer = []
def log_query(
self,
query: str,
answer: str,
retrieved_docs: list[dict],
latency_ms: float,
user_feedback: str = None
):
"""
Log every RAG query for monitoring.
"""
metric = {
'timestamp': datetime.now(),
'query_length': len(query.split()),
'answer_length': len(answer.split()),
'num_docs_retrieved': len(retrieved_docs),
'latency_ms': latency_ms,
'user_feedback': user_feedback, # thumbs up/down
'retrieval_scores': [d['score'] for d in retrieved_docs]
}
self.metrics_buffer.append(metric)
# Alert on anomalies
self.check_anomalies(metric)
def check_anomalies(self, metric: dict):
"""
Detect potential issues.
"""
# High latency
if metric['latency_ms'] > 5000:
alert('High latency', metric)
# No documents retrieved
if metric['num_docs_retrieved'] == 0:
alert('Retrieval failure', metric)
# Low retrieval scores
if metric['retrieval_scores'] and max(metric['retrieval_scores']) < 0.5:
alert('Low relevance scores', metric)
def generate_report(self, time_window: str = '24h') -> dict:
"""
Aggregate metrics over time window.
"""
recent_metrics = filter_by_time(self.metrics_buffer, time_window)
return {
'total_queries': len(recent_metrics),
'avg_latency_ms': mean([m['latency_ms'] for m in recent_metrics]),
'avg_docs_retrieved': mean([m['num_docs_retrieved'] for m in recent_metrics]),
'positive_feedback_rate': sum(
1 for m in recent_metrics
if m['user_feedback'] == 'positive'
) / len(recent_metrics),
'zero_results_rate': sum(
1 for m in recent_metrics
if m['num_docs_retrieved'] == 0
) / len(recent_metrics)
}
Regression Detection
class RegressionDetector:
"""
Detect when RAG quality degrades.
"""
def __init__(self, test_cases: list[RAGTestCase]):
self.test_cases = test_cases
self.baseline_metrics = None
def establish_baseline(self, rag_system):
"""
Run evaluation and save as baseline.
"""
evaluator = RAGEvaluator(rag_system, self.test_cases)
self.baseline_metrics = evaluator.evaluate_all()
def detect_regression(self, rag_system, threshold: float = 0.05):
"""
Check if current performance dropped significantly.
"""
evaluator = RAGEvaluator(rag_system, self.test_cases)
current_metrics = evaluator.evaluate_all()
regressions = []
# Compare retrieval metrics
for metric in ['precision@5', 'recall@5', 'mrr']:
baseline = self.baseline_metrics['retrieval_metrics'][metric]
current = current_metrics['retrieval_metrics'][metric]
delta = baseline - current
if delta > threshold:
regressions.append({
'metric': metric,
'baseline': baseline,
'current': current,
'delta': delta
})
# Compare generation metrics
for metric in ['faithfulness', 'relevance']:
baseline = self.baseline_metrics['generation_metrics'][metric]
current = current_metrics['generation_metrics'][metric]
delta = baseline - current
if delta > threshold:
regressions.append({
'metric': metric,
'baseline': baseline,
'current': current,
'delta': delta
})
return regressions
# Usage in CI/CD
detector = RegressionDetector(test_cases)
detector.establish_baseline(rag_system_v1)
# Before deploying v2
regressions = detector.detect_regression(rag_system_v2)
if regressions:
print("REGRESSION DETECTED:")
for reg in regressions:
print(f"{reg['metric']}: {reg['baseline']:.3f} → {reg['current']:.3f}")
raise Exception("Cannot deploy: performance regression")
Conclusion
Evaluation is not optional for production RAG.
Key practices:
- Measure retrieval separately: Precision, recall, MRR, NDCG
- Measure generation quality: Faithfulness, relevance, completeness
- Automate evaluation: Use LLM-as-judge for scalability
- Monitor continuously: Track metrics in production
- Detect regressions: Prevent quality degradation
Build evaluation infrastructure early. Without it, you are optimizing blindly.
The goal is not perfect scores. The goal is measurable improvement over time.
Continue learning
Intentional links