Metrics That Actually Predict User Satisfaction
— You can measure accuracy, latency, and token costs easily. But the metrics that matter most are the ones that correlate with whether users find your AI system valuable.
Your LLM system achieves 94% accuracy on your eval set. Is that good? It depends: good at what? Accuracy measures whether outputs match expected answers, but users don’t care about eval sets. They care about whether your system solves their problems.
The hard part of LLM evaluation isn’t measuring what’s easy to measure—it’s identifying which metrics actually predict whether users will be satisfied, return, and recommend your product.
The Proxy Metric Trap
Most technical metrics are proxies. They’re easier to measure than the thing you actually care about, so you measure them and hope they correlate.
Accuracy: Measures whether outputs match ground truth labels. But users might be satisfied with outputs that don’t match your labels, or dissatisfied with technically correct responses that don’t address their needs.
Latency: Measures response time. Important, but diminishing returns—improving from 10s to 5s matters more than improving from 2s to 1s. Users won’t notice the latter, but you’ll pay more for faster infrastructure.
Token costs: Measures how much you’re spending. Critical for unit economics, but optimizing for cost alone often degrades quality. Users don’t care what it costs you—they care whether it works.
These metrics matter, but they’re not the goal. They’re inputs to the goal: building something users find valuable.
Task Success Rate: The North Star
The most important question: did the user accomplish what they wanted?
For a customer support chatbot, task success means the user got their question answered without escalating to a human. For a code completion tool, it means the user accepted the suggestion. For a writing assistant, it means the user kept the generated text.
This is hard to measure automatically because you often need to infer user intent and then observe whether they achieved it. But it’s worth the investment because it directly measures value delivered.
Explicit signals: If your interface has thumbs up/down buttons, acceptance clicks, or completion confirmations, these are direct task success signals.
Implicit signals: User behavior indicates satisfaction even without explicit feedback. Did they edit the output minimally or rewrite it completely? Did they immediately rephrase their query (suggesting the first response failed)? Did they abandon the session?
Follow-up actions: In some domains, you can measure downstream success. If your AI generates code, did it pass tests and get committed? If it writes emails, did recipients respond? If it suggests products, did users purchase?
Leading vs. Lagging Indicators
User satisfaction is a lagging indicator—you only know it after users interact with your system for a while. Leading indicators predict satisfaction before you’ve accumulated enough usage data.
Instruction following rate: If your system consistently does what users ask, satisfaction will likely be high. Measure how often outputs comply with explicit user constraints.
Uncertainty and hedging: When LLMs don’t know, they hedge (“This might be…”) or refuse to answer. Excessive hedging signals low confidence, which users perceive as unreliable even if the hedged answer is technically correct.
Coherence and relevance: Outputs that wander off-topic or contradict themselves frustrate users. These issues are detectable with automated checks and correlate with poor user ratings.
Error recovery: Users make typos, phrase questions poorly, or change their mind mid-conversation. Systems that gracefully handle these situations score higher on satisfaction than systems that fail on edge cases.
Measuring What Breaks Trust
Users forgive many imperfections, but certain failures break trust and cause churn:
Hallucinations: Confident-sounding false information is worse than admitting uncertainty. Track hallucination rates aggressively, especially in domains where accuracy is critical (medical, legal, financial).
Inconsistency: If users ask the same question twice and get contradictory answers, trust erodes. Measure response variance for similar queries.
Bias and inappropriate content: Even one bad experience—offensive output, stereotyped response, or leaked PII—can permanently damage trust. These failures might be rare (0.1% of requests) but catastrophic. You need 100% coverage testing for safety, not statistical sampling.
Silent failures: Systems that generate plausible but incorrect information are worse than systems that obviously break. Users won’t notice the error until damage is done. Track silent failure rates through spot-checking and user-reported issues.
Engagement Metrics as Satisfaction Proxies
If users are satisfied, they engage more:
Return rate: Do users come back after their first session? High churn after first use suggests the system didn’t deliver value.
Session depth: How many interactions do users have per session? Deeper sessions suggest users find the system useful enough to continue using it.
Feature adoption: If you add new capabilities, do users discover and use them? Low adoption might mean features aren’t valuable or users don’t understand they exist.
Retention cohorts: Track cohorts over time. Do Week 1 users still use your system in Week 8? Retention curves reveal long-term satisfaction.
The Human Review Layer
No automated metric fully captures user satisfaction. You need qualitative feedback:
Direct user ratings: Ask users to rate outputs or overall satisfaction. This is subjective but actionable. Analyze patterns in low ratings to identify problems.
User interviews: Talk to power users and detractors. What do they love? What frustrates them? Qualitative insights reveal blind spots that metrics miss.
Support ticket analysis: If users contact support, what are they asking? Common complaints indicate systematic failures your metrics aren’t catching.
Session replays: Watch how users actually interact with your system. Do they struggle with the interface? Misunderstand how to phrase queries? Encounter errors you didn’t anticipate?
Balancing Multiple Metrics
No single metric tells the whole story. You need a balanced scorecard:
Quality: Accuracy, coherence, instruction-following Speed: Latency (p50, p95, p99) Cost: Token usage, compute costs Safety: Hallucination rate, inappropriate content rate Satisfaction: Task success, user ratings, retention
The weights depend on your use case. A creative writing tool prioritizes quality over cost. A customer support bot prioritizes speed and cost efficiency. A medical diagnosis tool prioritizes safety and accuracy above all else.
The Feedback Loop: Metrics to Improvements
Metrics are only useful if they drive action. Build a system that:
Surfaces problems: Automated dashboards that alert when metrics degrade Enables investigation: Logs and traces that let you debug why metrics changed Prioritizes fixes: Rank issues by impact on user satisfaction, not just frequency Validates improvements: A/B test changes and measure whether satisfaction metrics improve
What Good Looks Like
A mature LLM product team tracks:
- Task success rate as the primary metric
- Leading indicators (coherence, instruction-following) that predict satisfaction
- Trust-breaking failures (hallucinations, bias) with zero tolerance
- Engagement metrics (retention, session depth) as long-term health signals
- Qualitative user feedback to catch blind spots
They don’t obsess over small accuracy improvements. They obsess over whether users are getting value, and they ruthlessly prioritize changes that move that needle.
Metrics are a map, not the territory. The territory is user satisfaction. Build a map that helps you navigate toward it, but never confuse the two.