Testing LLMs Is Different (And Why Unit Tests Aren't Enough)

— Traditional software testing assumes determinism. LLM outputs are probabilistic. Here's why you need a completely different approach to quality control.

level: intermediate topics: evaluation, testing, quality-control tags: testing, evaluation, quality, production

When you write a traditional function that adds two numbers, you test it once with inputs 2 and 3, verify it returns 5, and you’re done. Run that test a thousand times, you get 5 every time. This is the foundation of software testing: determinism.

LLMs break this model completely.

The Determinism Problem

Call an LLM with the same prompt today and tomorrow, and you might get different outputs. Set temperature to 0, and you’ll get more consistent outputs, but not identical ones. The model weights might update. The tokenization might change. The serving infrastructure might route to different instances.

Even ignoring external changes, LLMs are fundamentally probabilistic. They’re choosing from probability distributions, not executing deterministic logic. Your “test” might pass 95 times out of 100 and fail 5 times. Is that acceptable? It depends on your use case, and that’s not a question traditional testing frameworks can answer.

What Traditional Tests Can’t Capture

Semantic equivalence: “The capital of France is Paris” and “Paris is France’s capital city” mean the same thing, but string equality checks fail. Traditional assertions like assertEqual don’t work when multiple correct answers exist.

Contextual appropriateness: An LLM might generate factually correct text that’s completely inappropriate for the context. A unit test can’t evaluate whether the tone matches your brand voice or whether the response addresses the user’s underlying intent.

Edge case coverage: With deterministic code, you can enumerate edge cases. With LLMs, the input space is infinite natural language. You can’t write unit tests for “all possible ways a user might ask about refund policies.”

Graceful degradation: When your code encounters an error, it crashes or returns null. When an LLM encounters ambiguity, it guesses. It might guess well or guess poorly, and a binary pass/fail test can’t measure the quality of that degradation.

What You Actually Need to Test

Instead of testing for exact outputs, you need to test for properties and behaviors:

Correctness: Does the output contain accurate information? This isn’t a string match—it’s semantic evaluation. You need either human evaluators or another LLM to judge whether the information is correct.

Coherence: Is the output logically consistent? Does it contradict itself? Does it stay on topic?

Safety: Does the output avoid generating harmful, biased, or inappropriate content? This requires testing against adversarial inputs, not just happy paths.

Consistency: When given similar inputs, does the system produce similar outputs? You’re not looking for identical text, but consistent structure, tone, and factual claims.

Robustness: How does the system behave with typos, unusual phrasing, or out-of-distribution inputs? Traditional software either works or throws an error. LLMs degrade gradually, and you need to understand that degradation curve.

The Evaluation Pipeline You Actually Build

Testing LLMs isn’t a single step—it’s a pipeline with different validation layers:

Pre-deployment evaluation: Before you ship, you run your prompts against a curated test set. This isn’t pass/fail unit testing—it’s statistical validation. You’re measuring aggregate metrics: what percentage of outputs meet quality bars? You might accept 95% accuracy, not 100%.

A/B testing in production: You can’t test all edge cases before deployment. Instead, you deploy changes to a small percentage of traffic, measure real-world performance, and roll back if metrics degrade. This is less like unit testing and more like gradual rollouts with monitoring.

Continuous evaluation: Even after full deployment, you continuously sample outputs and evaluate them. LLM providers update models, user behavior shifts, and edge cases emerge. Quality control is ongoing, not a one-time gate.

Human review loops: Some percentage of outputs need human evaluation. You can’t automate all quality assessment. Budget for human reviewers, build tools to make review efficient, and use their feedback to improve automated evaluation.

Assertion Patterns That Work

When you do write automated checks, they look different from traditional tests:

Format validation: Check that JSON outputs parse correctly, required fields exist, and types match expectations. This is deterministic and works like traditional testing.

Constraint verification: Verify that outputs respect hard constraints. If you asked for a 3-item list, check that you got 3 items. If you required PG-rated content, check that profanity filters pass.

Semantic similarity: Use embedding models to verify that outputs are semantically similar to expected responses. You’re not checking for exact matches—you’re checking that outputs cluster near acceptable responses in embedding space.

Regression detection: Store previous outputs for a test set, and flag when new outputs diverge significantly. You’re not blocking deployment—you’re alerting humans to review changes.

The Mental Model Shift

Traditional testing asks: “Does this code do exactly what I expect?”

LLM testing asks: “Does this system usually produce acceptable outputs for my use case?”

The first question is binary. The second is statistical.

This means your testing infrastructure needs to support:

  • Metrics and thresholds instead of pass/fail assertions
  • Human-in-the-loop review instead of fully automated pipelines
  • Continuous monitoring instead of pre-deployment gates
  • Statistical sampling instead of exhaustive coverage

What This Means for Your Team

If you’re used to 100% test coverage and green CI pipelines, LLM development feels uncomfortable. You’re shipping code you can’t fully predict. You’re accepting some failure rate as normal.

This isn’t sloppiness—it’s the nature of probabilistic systems. The goal isn’t to eliminate all errors. The goal is to understand your error distribution, set acceptable thresholds, and monitor for drift.

Your testing strategy becomes a risk management strategy: What failure modes matter most? What metrics predict user satisfaction? How do you detect problems before they scale?

Different questions require different tools, and that’s what the rest of this learning path covers.