Debugging LLM Failures in Production
— When traditional code fails, stack traces tell you what went wrong. When LLMs fail, you get plausible-sounding nonsense or silence. Here's how to debug the undebugable.
A user reports: “The AI gave me a completely wrong answer.” You check logs. The API call succeeded. Status 200. No errors. The system is “working.” But the output was garbage.
This is the nightmare of LLM debugging. Traditional debugging relies on error messages, stack traces, and deterministic reproduction. LLMs give you none of that. The same input might produce different outputs. Failures are often silent—the system returns confidently wrong information without any error signal.
Debugging LLM systems requires different tools and different thinking.
The Logging You Actually Need
Traditional logging captures exceptions and errors. LLM logging needs to capture everything, because the “error” might be in the output, not in execution.
Full request and response: Log the complete user query, system prompt, any retrieved context, and the full LLM response. Without this, you’re guessing about what happened.
Intermediate reasoning: If you’re using chain-of-thought or agent systems, log every thought step and tool call. The failure might be in step 3 of 10, and you need to see the whole chain.
Token counts and metadata: Log input token count, output token count, model version, temperature, and other parameters. Sometimes failures correlate with specific settings.
Timestamps and latency: Track when each step occurred and how long it took. Slow responses might indicate timeouts or retries that affected quality.
Request IDs: Assign unique IDs to every request and propagate them through all systems. This lets you trace a single user request across logs, databases, and external APIs.
Reproducing Non-Deterministic Failures
The user says “it gave me a bad answer.” You can’t reproduce it because LLMs are probabilistic.
Capture exact prompts: If you log the exact prompt sent to the LLM, you can replay it. Even with temperature > 0, replaying helps you understand the failure mode.
Seed control: Some APIs let you set a random seed for deterministic outputs. Use this in debugging to reproduce exact responses.
Multiple replays: If the failure is probabilistic (happens 30% of the time), replay the same prompt 10 times and see the distribution of outputs. This reveals whether the bad output is an outlier or common.
Common Failure Patterns
LLM failures fall into recognizable categories. Knowing them helps diagnosis.
Hallucinations: The model makes up facts. Debug by checking whether the information was in the context. If not, the model hallucinated. Fix: improve retrieval, add explicit citations, or instruct the model to admit when it doesn’t know.
Instruction-following failures: The model ignores constraints (like “respond in 3 bullet points”). Debug by checking the prompt structure. Maybe the instruction is buried in context. Fix: move critical instructions to the beginning or end, use explicit formatting cues.
Context overflow: The input exceeds the model’s context window, so later parts get truncated. Debug by checking token counts. Fix: summarize context, prioritize most relevant chunks, or use models with larger context windows.
Prompt injection: The user input includes adversarial text that overrides your instructions (“Ignore previous instructions and…”). Debug by examining user input for suspicious patterns. Fix: input sanitization, stronger system prompts, or explicit boundaries between instructions and user data.
Inconsistency: The model contradicts itself within a single response. Debug by analyzing the full output for logical contradictions. Fix: prompt for step-by-step reasoning, use multiple passes with validation, or reduce temperature for more consistent outputs.
Refusal or hedging: The model won’t answer or hedges excessively (“I’m not sure, but maybe…”). Debug by checking if the query triggered safety filters or if the model lacks confidence. Fix: rephrase prompts to be clearer, provide more context, or adjust model parameters.
Debugging Tools and Techniques
Diff analysis: Compare the failing prompt to a successful one. What’s different? Maybe a single word change broke instruction-following.
Ablation testing: Remove parts of the prompt incrementally. If removing section X fixes the issue, you’ve found the culprit.
Prompt forensics: Analyze how the model interpreted the prompt. Sometimes adding “Before answering, restate what you’re being asked to do” reveals misunderstandings.
Manual replay with variations: Change one variable at a time (temperature, model, prompt wording) and observe effects. This isolates the cause.
Statistical analysis: If failures are intermittent, analyze logs for patterns. Do failures correlate with time of day? Input length? Specific keywords?
The “Shadow LLM” Technique
Run a stronger, more expensive model in parallel on a sample of requests. Compare its outputs to your production model.
How it works: Send the same prompt to GPT-4 (expensive, high-quality) and GPT-3.5 (cheap, production model). Log both outputs.
What you learn: If GPT-4 consistently gets it right and GPT-3.5 fails, the issue is model capability. If both fail, the issue is the prompt or context.
When to use it: Investigating quality degradations or comparing models before switching.
User Feedback as Debug Signal
Not all failures are visible in logs. Users notice issues you don’t.
Thumbs down tracking: If users rate outputs poorly, flag those requests for review. What did the LLM do wrong?
Support ticket analysis: Common support complaints indicate systematic failures. “The AI doesn’t understand my question” might mean prompt issues or missing retrieval.
Session replay: If your app allows it, watch how users interact. Do they rephrase queries multiple times? That suggests the first response failed.
Correlating Failures with Changes
When quality suddenly degrades, what changed?
Model updates: Did the provider update the model? Even “minor” updates can change behavior. Track model versions in logs.
Prompt changes: Did you recently update prompts? Correlate quality metrics with prompt versions.
Data changes: For RAG systems, did the knowledge base update? New documents might introduce retrieval issues.
Traffic shifts: Did user behavior change? A surge in a new type of query might reveal weaknesses your system wasn’t designed for.
Use timestamps and version metadata to correlate failures with changes.
The “Golden Path” Technique
Identify a request that works well. Treat it as a reference.
How it works: Find a successful request that’s representative of your use case. Log it as a “golden path” example. When debugging failures, compare to the golden path: what’s different?
What you learn: Differences in input length, query structure, or context formatting might explain failures.
Distributed Tracing for Complex Flows
In multi-step LLM pipelines (RAG, agents, multi-model systems), failures can occur anywhere.
Trace propagation: Assign a trace ID to each user request and propagate it through all steps (retrieval, LLM call, post-processing).
Span analysis: Each step is a “span” with timing and metadata. If LLM output is bad, check if retrieval was bad first. Trace upstream to find the root cause.
Visualization: Tools like Datadog, LangSmith, or OpenTelemetry can visualize traces, showing you the full flow and where things went wrong.
Testing Hypotheses
Debugging LLMs is hypothesis-driven.
Form a hypothesis: “The model fails because the prompt is too long.”
Test it: Shorten the prompt and retry. Did quality improve?
Iterate: If yes, you found the issue. If no, form a new hypothesis.
This is slower than traditional debugging (where stack traces point to the issue), but it’s necessary when dealing with probabilistic systems.
What to Do When You Can’t Fix It
Sometimes the issue is fundamental: the model isn’t capable, the task is too hard, or user intent is ambiguous.
Detect and escalate: Use confidence scoring or validation checks to detect low-quality outputs and escalate to humans.
Provide fallback: If the LLM can’t answer, fall back to simpler methods (keyword search, rule-based systems) or admit “I don’t know.”
Communicate limitations: Be transparent with users. “This feature works best for X type of questions” helps set expectations.
Building Debug-Friendly Systems
Design your system to be debuggable from the start:
- Log everything (requests, responses, intermediate steps)
- Use unique request IDs for tracing
- Version prompts and models explicitly
- Implement shadow testing and A/B comparisons
- Collect user feedback systematically
- Build tooling to replay and analyze failures
The Mental Model
Traditional debugging: find the line of code that’s wrong, fix it.
LLM debugging: find the combination of prompt, context, model, and parameters that produces bad outputs, then iteratively adjust until outputs improve.
It’s slower, less deterministic, and more experimental. But with the right logging, tools, and mental models, you can systematically diagnose and fix LLM failures—even when they don’t throw errors.