LLM-as-Judge: When It Works and When It Fails
— Using one LLM to evaluate another sounds circular, but it's one of the most practical ways to scale quality assessment. Here's when it's reliable and when you need humans.
You’ve built an LLM system. Now you need to evaluate its outputs. You could hire humans to review every response, but that’s slow and expensive. You could write deterministic checks, but those only catch format errors, not semantic quality.
So you use another LLM to evaluate the first LLM’s outputs. This is called “LLM-as-judge,” and it’s simultaneously one of the most useful and most misused techniques in AI engineering.
Why This Isn’t Circular
The obvious objection: if LLMs make mistakes, how can an LLM reliably evaluate another LLM?
Two reasons this works:
Evaluation is easier than generation: Judging whether an answer is correct is cognitively simpler than generating the answer. Humans do this all the time—multiple choice tests are easier than essays. LLMs show the same pattern: models that struggle to generate good summaries can often judge whether someone else’s summary is good.
You can use stronger models to evaluate weaker ones: Your production system might use a fast, cheap model (GPT-3.5, Claude Haiku) for cost and latency reasons. Your evaluator can use a slower, more capable model (GPT-4, Claude Opus) because it only runs on a sample of outputs, not every request.
What LLM-as-Judge Is Good At
Factual accuracy: Given a reference answer and a generated answer, an LLM judge can assess whether the generated answer contains the same key facts. This works well for closed-domain questions with verifiable answers.
Instruction following: If your prompt asks for specific format or constraints (3 bullet points, no speculation, formal tone), an LLM judge can check compliance better than regex or rules-based validation.
Semantic similarity: Checking whether two texts mean the same thing, even if phrased differently. Traditional NLP metrics like BLEU or ROUGE measure word overlap; LLM judges understand meaning.
Rubric-based evaluation: If you can define clear evaluation criteria (relevance, coherence, completeness), you can give the judge LLM those criteria and ask it to score each dimension. This scales human judgment processes.
Comparative evaluation: Instead of asking “Is this output good?”, ask “Which of these two outputs is better?” Pairwise comparisons are more reliable than absolute scoring, and they enable ELO-style ranking systems.
What LLM-as-Judge Is Bad At
Subjective preferences: Brand voice, tone, style—these are human preferences that vary by context and audience. An LLM judge can check for consistency with examples, but it can’t reliably capture subtle brand distinctions.
Rare or specialized domains: If your LLM system handles medical advice, legal analysis, or specialized technical domains, the judge LLM might not have the domain knowledge to evaluate correctness. You need human experts.
Adversarial robustness: LLM judges can be fooled by confident-sounding but incorrect outputs. If your production system generates plausible-but-wrong information, the judge might rate it highly because it’s coherent and well-structured.
Edge cases and safety violations: Subtle forms of bias, inappropriate implications, or carefully crafted jailbreak attempts might slip past an automated judge. These require human review.
Creativity and novelty: If you’re generating creative content, an LLM judge might penalize unusual or innovative outputs that don’t match common patterns. Human judgment is essential for creative domains.
How to Structure Judge Prompts
The quality of your evaluation depends heavily on how you prompt the judge LLM. Vague prompts produce unreliable scores.
Provide clear criteria: Don’t ask “Is this summary good?” Instead: “Evaluate this summary on three criteria: (1) Does it capture the key points from the source? (2) Is it concise? (3) Is it free of factual errors? Score each criterion 1-5 and explain your reasoning.”
Include examples: Show the judge what good and bad outputs look like. This calibrates the model’s judgment and reduces variance in scoring.
Ask for reasoning before scoring: Chain-of-thought prompting works for evaluation too. Ask the judge to explain its reasoning first, then provide a score. This improves reliability and gives you debuggable explanations.
Use structured output: Ask for scores in JSON format so you can programmatically process them. This makes it easier to aggregate results and track trends.
Bias in LLM Judges
LLM judges aren’t neutral. They have systematic biases:
Position bias: When evaluating multiple outputs, judges often favor the first option (primacy bias) or the last option (recency bias). Randomize the order of outputs across evaluation runs.
Verbosity bias: Longer outputs often score higher, even when they’re not better. Control for this by including examples where the shorter output is correct.
Style bias: Judges favor outputs stylistically similar to their training data. If your system generates informal text but the judge was trained on formal corpora, scores might be artificially low.
Self-preference bias: Models judge their own outputs more favorably than others’ outputs. If you’re using GPT-4 to evaluate GPT-4, be aware of this bias and consider using a different model as judge.
Validating Your Judge
Before trusting an LLM judge, validate it against human judgment:
Correlation testing: Have humans evaluate a sample of outputs. Calculate correlation between human scores and LLM judge scores. Strong correlation (0.7+) suggests the judge is reliable.
Agreement rate: For binary judgments (acceptable vs. unacceptable), measure how often the LLM agrees with humans. 85%+ agreement is a good target.
Error analysis: Review cases where the LLM judge disagrees with humans. Are there systematic patterns? Certain types of errors the judge misses? Use these insights to refine judge prompts.
Calibration: Check whether the judge’s confidence matches reality. If it assigns 90% scores, do 90% of those outputs actually meet the quality bar when humans review them?
Multi-Judge Ensembles
Using multiple LLM judges and aggregating their scores can improve reliability:
Different models: Combine judgments from GPT-4, Claude, and Gemini. Each has different biases; averaging reduces systematic error.
Different prompts: Evaluate the same output with different rubrics or framing. This catches issues that one perspective might miss.
Majority voting: For binary decisions, use multiple judges and take the majority vote. This reduces variance from any single judge’s mistakes.
Weighted aggregation: If you’ve validated judges against human data, weight more reliable judges more heavily in the final score.
When to Escalate to Humans
LLM judges should be part of a multi-tier system:
First tier—automated filtering: LLM judges evaluate all outputs. Flag those that fail clear quality bars for human review.
Second tier—sampling: Humans review a random sample of outputs that passed automated checks, to catch systematic errors the judge misses.
Third tier—active learning: Prioritize human review for cases where the LLM judge is uncertain (mid-range scores) or where multiple judges disagree.
Fourth tier—incident response: When users report problems, human experts investigate and determine whether it’s a one-off mistake or a systematic issue.
Practical Implementation Pattern
Here’s how LLM-as-judge fits into a realistic evaluation pipeline:
Development: Use LLM judges to quickly iterate on prompts and system designs. Fast feedback loops enable rapid experimentation.
Pre-deployment: Run comprehensive evals with LLM judges on a large test set. Ensure quality metrics meet your thresholds before shipping.
Production monitoring: Sample a small percentage of production outputs (1-5%) and evaluate with LLM judges. Alert if quality metrics degrade.
Continuous improvement: Periodically have humans review a sample of LLM-judged outputs. Use disagreements to refine judge prompts or escalate to human review tiers.
The Right Mental Model
Think of LLM-as-judge not as a replacement for human evaluation, but as a scalable approximation. It automates the easy 95% of quality assessment, freeing humans to focus on edge cases, adversarial inputs, and strategic decisions.
It’s reliable when:
- Evaluation criteria are clear and objective
- You’re evaluating factual accuracy or instruction-following
- You’ve validated the judge against human ground truth
- You’re using comparative evaluation rather than absolute scoring
It’s unreliable when:
- Criteria are subjective or domain-specific
- You’re evaluating safety, bias, or adversarial robustness
- You need to catch rare edge cases
- You’re evaluating creative or novel outputs
Use it where it works, validate it rigorously, and always have a human review tier for the cases that matter most.