Building Eval Sets That Actually Catch Problems
— A good evaluation dataset isn't just random examples. It's a carefully curated collection that stresses your system where it's most likely to fail.
You can’t test an LLM system by running it once and checking the output. You need a dataset of examples that represent real-world usage, edge cases, and failure modes. But most teams build eval sets wrong: they grab random examples from production logs, maybe clean them up a bit, and call it done.
This catches some problems. It misses most others.
Why Random Samples Miss Problems
If you randomly sample 100 examples from production, you’ll get a distribution that matches your overall traffic. Sounds good, right? But most of your traffic represents the easy cases—the queries your system already handles well.
The problems hide in the tail: the 5% of queries that are ambiguous, adversarial, or edge cases. Random sampling gives you 5 of those in your eval set. That’s not enough to detect regressions reliably.
Worse, if a category of problematic query only appears in 0.5% of traffic, random sampling might miss it entirely. You won’t know your system fails on legal disclaimers until a user complains.
Stratified Sampling: Start Here
Instead of random sampling, stratify your data. Identify the meaningful categories in your use case and ensure each category is represented proportionally—or over-represented if it’s high-risk.
For a customer support chatbot, you might stratify by:
- Intent type (billing, technical support, account management)
- User sentiment (neutral, frustrated, angry)
- Query complexity (single question, multi-step, ambiguous)
- Domain-specific edge cases (refunds, cancellations, privacy requests)
Now you’re intentionally testing each category. If billing queries represent 10% of traffic but cause 40% of escalations, maybe they should be 25% of your eval set.
This catches more problems, but still misses the adversarial cases and rare edge cases.
Deliberate Edge Case Construction
Some failures won’t appear in production logs because users avoid them or because your system degrades silently. You need to deliberately construct examples that stress specific behaviors.
Ambiguous queries: Real users write unclear requests. Create examples with multiple valid interpretations and verify your system either picks the most likely one or asks clarifying questions.
Boundary testing: If your system extracts structured data, test the boundaries. What happens with 0 items? 100 items? Missing required fields? Malformed inputs?
Adversarial inputs: Users will try to break your system. Test prompt injections, attempts to extract system prompts, jailbreak techniques, and requests for inappropriate content.
Out-of-domain queries: Your system is trained for specific tasks. What happens when users ask off-topic questions? It should gracefully decline, not make up answers.
Multilingual and mixed-language inputs: Even if your system is English-only, users will input other languages. Test behavior with Spanish queries, code-switched text, or non-Latin scripts.
The Gold Standard Problem
Every eval set needs ground truth: the expected correct output for each input. But LLM outputs are rarely deterministic, so what does “correct” mean?
For factual tasks: The correct answer might be exact. “What’s the capital of France?” has one right answer. But “Summarize this document” has many acceptable outputs.
For open-ended tasks: You need rubrics, not exact strings. Define what makes a good summary (accuracy, conciseness, coverage of key points) and evaluate against those criteria.
For multi-step tasks: The exact path doesn’t matter, only the outcome. If an agent can solve a problem multiple ways, your eval should check whether it solved the problem, not whether it used specific steps.
Human Labeling at Scale
You can’t manually label thousands of examples, but you can bootstrap:
Start small: Manually label 50-100 examples per category. This gives you a seed set for initial testing.
Active learning: Run your system, identify examples where it’s uncertain or likely wrong, and prioritize those for human labeling. This concentrates effort where it matters most.
Partial automation: Use stronger models (GPT-4, Claude) to generate initial labels, then have humans review. This is faster than labeling from scratch and often catches errors the human might miss.
Crowdsourcing with validation: If you use crowdworkers, include gold-standard examples with known answers to verify labeler quality. Require multiple labelers per example and resolve disagreements.
Version Control for Eval Sets
Your eval set evolves as your system improves and as new failure modes emerge. Treat it like code:
Track changes: Use version control. When you add new examples or update labels, document why. This creates institutional memory about known issues.
Regression tests: When you fix a bug, add an example that would have caught it to your eval set. This prevents regressions.
Deprecate outdated examples: If you redesign your system, some eval examples might no longer be relevant. Remove them rather than cluttering the dataset.
Snapshot performance: Track your system’s performance on the same eval set over time. If accuracy suddenly drops, you have a regression. If it plateaus, you’ve saturated the eval set and need new challenges.
Category-Specific Eval Sets
Don’t use one giant eval set for everything. Build specialized sets for different purposes:
Smoke tests: A small set (20-50 examples) of critical functionality that must always work. Run this on every commit. If it fails, block deployment.
Comprehensive regression tests: A larger set (200-500 examples) covering all major categories. Run this before releases.
Adversarial tests: A dedicated set of challenging inputs designed to break your system. Run this periodically to identify vulnerabilities.
Performance benchmarks: A stable set that doesn’t change, used to track performance trends over time.
The Flywheel: Production Feedback Improves Evals
Your eval set should continuously improve based on production failures:
Monitor production: Flag outputs that users rate poorly, that trigger escalations, or that violate safety policies.
Root cause analysis: Don’t just add failed examples to your eval set—understand why they failed. What category of problem is this? What system behavior needs improvement?
Targeted expansion: Add examples to your eval set that would have caught the production failure. Now you’re testing for that failure mode in future iterations.
Close the loop: When you fix the underlying issue, verify that your new eval examples now pass. This confirms the fix and prevents future regressions.
What Good Looks Like
A well-constructed eval set:
- Covers all major use cases proportionally (or over-represents high-risk cases)
- Includes deliberate edge cases and adversarial examples
- Has clear, measurable success criteria for each example
- Evolves based on production failures and system changes
- Supports multiple levels of testing (smoke, regression, adversarial)
- Enables tracking performance trends over time
The goal isn’t a perfect dataset—those don’t exist for LLM systems. The goal is a dataset that catches meaningful problems before users encounter them, and that helps you understand where your system is improving or degrading.
Your eval set is your early warning system. Build it deliberately, maintain it carefully, and trust it to tell you when something’s wrong.