Monitoring LLM Systems in Production (Beyond Uptime)

— Traditional monitoring checks if services are up and latency is acceptable. LLM monitoring needs to track whether outputs are still good—and that's much harder.

level: intermediate topics: llmops, monitoring, observability tags: monitoring, observability, production, quality

Your web service has standard monitoring: uptime checks, latency percentiles, error rates, CPU and memory usage. When something breaks, alerts fire, and you investigate.

LLM systems need all of that, plus something traditional monitoring doesn’t cover: is the system still producing good outputs?

Your LLM service might be up, responding quickly, and returning 200 status codes—while simultaneously generating nonsense answers because the model was updated, a prompt was changed, or user behavior shifted. Traditional monitoring misses this entirely.

The Quality Drift Problem

LLM outputs aren’t static. Models get updated by providers. Your prompts evolve. User queries change. Upstream data sources shift. Any of these can cause output quality to degrade, even though the system technically works fine.

Model updates: OpenAI or Anthropic push a new model version. It’s supposed to be better, but for your specific use case, it performs worse. You won’t know unless you’re actively measuring output quality.

Prompt changes: A developer tweaks a prompt to fix one issue and inadvertently breaks something else. The change deploys, and quality metrics drop, but you don’t notice until users complain.

Data drift: Your RAG system retrieves from a knowledge base that’s been updated. New documents change retrieval patterns, and answers become less relevant.

User behavior changes: Users start asking different types of questions. Your system was tuned for one distribution of queries, and now it’s getting another. Performance degrades.

Traditional uptime monitoring catches none of this. You need continuous quality monitoring.

Sampling for Quality Assessment

You can’t evaluate every single output—it’s too expensive. But you can evaluate a statistically significant sample.

Random sampling: Evaluate 1-5% of production outputs with automated checks (LLM-as-judge or rules-based validation). This gives you a baseline quality metric.

Stratified sampling: Ensure your sample covers different query types, user segments, and use cases proportionally. This prevents blind spots.

Targeted sampling: Prioritize evaluation of high-risk outputs (long responses, outputs containing sensitive topics, responses flagged by users).

Track quality metrics over time. If accuracy drops from 92% to 85%, that’s a regression worth investigating.

Metrics to Track Continuously

Refusal rate: How often does the LLM refuse to answer or hedge with “I don’t know”? A sudden increase suggests prompts or models changed in ways that reduce confidence.

Latency distribution: Track p50, p95, and p99 latencies. If p95 suddenly increases, you might be hitting rate limits, or the model provider is experiencing slowdowns.

Token usage: Monitor input and output token counts. If average tokens per request spike, something changed—longer system prompts, more verbose outputs, or users asking more complex questions.

Error rates: Track both hard errors (exceptions, 500s) and soft errors (malformed outputs that don’t match expected schemas).

User feedback signals: If you have thumbs up/down buttons or user ratings, track these as a real-time quality proxy. A drop in positive ratings is an early warning.

Hallucination rate: Use automated detection (fact-checking against sources, consistency checks) to estimate how often the model generates false information.

Alerting on Quality Degradation

Traditional alerting: if error rate exceeds 5%, page someone.

LLM alerting: if quality metrics drop below thresholds, investigate.

Set baseline metrics: Measure quality on a representative test set and in production sampling. Establish what “normal” looks like.

Define thresholds: Decide what level of degradation warrants investigation. A 5% drop in accuracy might be noise; a 15% drop is a real issue.

Alert on trends: Don’t just alert on absolute values—alert when metrics trend downward over time. A gradual decline is easier to miss than a sudden drop but equally damaging.

Separate alerts by severity: Minor quality dips might generate low-priority tickets. Major dips or safety failures should page immediately.

Tracing Individual Requests

When something goes wrong, you need to debug specific requests.

Request IDs: Assign a unique ID to every request and propagate it through all systems (LLM calls, retrieval, tool execution). This lets you reconstruct what happened.

Full logging: Log the user query, system prompt, retrieved context, tool calls, intermediate reasoning (if using chain-of-thought), and final output. Without this, debugging is guesswork.

Metadata tracking: Log which model version was used, which prompt version, which retrieval strategy. This helps correlate issues with specific configurations.

Replay capability: Ideally, you can replay a request with the exact same inputs to see if the issue is reproducible or a one-off fluke.

Distributed Tracing for Complex Pipelines

If your LLM system involves multiple steps (retrieval, reranking, generation, post-processing), you need distributed tracing.

Spans for each component: Each step (retrieval, LLM call, validation) is a span with timing and metadata. This shows where latency comes from and where failures occur.

Parent-child relationships: If one LLM call triggers tool execution that triggers another LLM call, the trace should show this hierarchy.

Correlation with external services: If you’re calling external APIs (search engines, databases, third-party models), include those in your traces too.

Tools like OpenTelemetry, Datadog, or LangSmith help with this, but you need to instrument your code properly.

Cost Monitoring

LLM costs are variable and can spike unexpectedly.

Track spend by endpoint: Know which parts of your application are driving costs. A rarely-used feature that makes expensive calls can blow your budget.

Monitor token usage trends: If average tokens per request increase, costs increase proportionally. Catch this early before it becomes a budget problem.

Set budget alerts: Configure alerts when daily or monthly spend exceeds thresholds. This prevents surprise bills.

Analyze cost vs. quality tradeoffs: If you’re using GPT-4 for everything, maybe some use cases could use GPT-3.5 without quality loss. Monitor whether cheaper models would work.

User Behavior Analytics

Understanding how users interact with your system helps predict problems.

Query patterns: Are users asking longer questions? More complex questions? Questions your system wasn’t designed for?

Abandonment rate: Do users give up after the first response? That suggests the response wasn’t helpful.

Retry patterns: If users rephrase the same question multiple times, the initial response failed.

Feature usage: Which parts of your system get used most? Least? Are new features being adopted?

Canary Deployments and Gradual Rollouts

Don’t push changes to 100% of traffic immediately. Roll out gradually and monitor.

Canary deployments: Deploy changes to 5% of traffic first. Monitor quality metrics for an hour. If metrics hold, increase to 25%, then 50%, then 100%.

A/B testing: Run old and new versions in parallel, split traffic, and compare quality metrics. Only promote the new version if it’s measurably better.

Rollback readiness: If quality metrics drop, you need to rollback quickly. Have automated rollback triggers and tested rollback procedures.

Anomaly Detection

Some issues are too subtle for fixed thresholds but still indicate problems.

Statistical anomaly detection: If latency is usually 1s ± 0.2s, and suddenly it’s 1.5s consistently, that’s an anomaly worth investigating even if it doesn’t cross a hard threshold.

Seasonal patterns: User behavior might vary by time of day or day of week. Anomaly detection should account for this, not alert on expected variations.

Multivariate analysis: Sometimes individual metrics look fine, but correlations between metrics reveal problems. Latency is normal, error rate is normal, but latency AND error rate both increased slightly—that’s suspicious.

Dashboard Design

You need dashboards that surface issues quickly.

High-level health: A single screen showing overall system health—traffic volume, error rate, latency, quality metrics, cost.

Drill-down views: Click into specific time ranges, user segments, or endpoints to investigate anomalies.

Real-time alerts: Show active alerts prominently, with context about why they fired.

Historical trends: Show metrics over days, weeks, months to detect slow degradation.

The Human Review Loop

Automated monitoring catches most issues, but not all. Budget for human review.

Spot checks: Have team members randomly review production outputs weekly. Do they look good? Anything surprising?

User-flagged issues: When users report problems, review those outputs and add them to your test sets.

Periodic quality audits: Every month or quarter, do a deep dive on quality metrics across different user segments and use cases.

What Good Looks Like

A well-monitored LLM system has:

  • Automated quality sampling on 1-5% of production traffic
  • Alerts on quality degradation, not just downtime
  • Full request tracing with replay capability
  • Cost tracking and budget alerts
  • Canary deployment processes with automated rollback
  • Dashboards that show both technical and quality metrics
  • Regular human review to catch issues automation misses

Monitoring LLM systems is harder than monitoring traditional software because quality is subjective, outputs are variable, and many issues are subtle. But with the right instrumentation, you can catch problems before they scale and maintain high quality in production.