Error Handling in Agent Systems (It's Not Like Regular Code)
— Traditional error handling assumes you can predict failure modes and write catch blocks. Agents fail in ways you can't anticipate, and they need to recover autonomously.
In traditional software, you write try-catch blocks for known error cases. Database connection fails? Catch the exception and retry with exponential backoff. API returns 429? Wait and retry. Input validation fails? Return a helpful error message.
This works because you can enumerate failure modes. But agents call tools you didn’t expect, with parameters you didn’t anticipate, in sequences you didn’t design. They fail in novel ways, and traditional error handling doesn’t cover it.
You need error handling that works when you can’t predict what will go wrong.
The Errors You Can Predict
Start with the obvious ones:
Tool execution failures: The API call times out. The database is unreachable. The rate limit is exceeded. These are standard software errors, and you handle them with standard techniques: retries, circuit breakers, fallbacks.
Invalid parameters: The LLM passes a string where an integer is required. It uses a parameter name that doesn’t exist. It calls a tool that doesn’t exist. These are validation errors, and you catch them before execution.
Permission errors: The agent tries to access a resource it’s not authorized for. You catch this at the authorization layer and return a clear error message.
Resource exhaustion: The agent hits token limits, time limits, or iteration limits. You detect these and terminate gracefully.
All of these are predictable. You can write explicit handlers for each.
The Errors You Can’t Predict
Semantic misuse: The agent calls a valid tool with valid parameters, but the call doesn’t make sense in context. Like calling “send_email” before “authenticate_user”, or “calculate_total” before “fetch_items”. The individual calls are fine; the sequence is wrong.
Cascading failures: The agent calls Tool A, which succeeds. It interprets the result incorrectly and calls Tool B with nonsense parameters. Tool B fails, the agent retries with different nonsense, and the system spirals into a failure loop.
Implicit assumptions violated: The agent assumes certain conditions hold (like “user is logged in” or “shopping cart is not empty”) and makes tool calls based on that assumption. If the assumption is wrong, tools return unexpected results, and the agent doesn’t know how to recover.
Ambiguous error states: A tool returns a result that technically succeeds but indicates a problem. Like a search tool returning zero results. Is that an error or a valid outcome? The agent might interpret it as success and proceed, or it might get confused and retry unnecessarily.
Surfacing Errors to the Agent
When a tool fails, you return an error to the LLM. How you format that error determines whether the agent can recover.
Opaque errors don’t help: “Error 500: Internal Server Error” tells the agent nothing. It doesn’t know if retrying will help, if it used the wrong parameters, or if it should try a different approach.
Actionable errors enable recovery: “Rate limit exceeded. Please wait 60 seconds and retry.” Now the agent knows what went wrong and what to do about it.
Contextual errors prevent confusion: “User ID 12345 not found. Verify the ID or search by email instead using the search_user_by_email tool.” This not only explains the error but suggests an alternative approach.
Think of error messages as communication with the LLM. The clearer the message, the better the agent can adapt.
Retry Strategies for Agents
Traditional retry logic: if an operation fails, retry N times with exponential backoff, then give up.
Agent retry logic: more complex, because the agent can adapt its approach.
Dumb retries: The agent calls the same tool with the same parameters multiple times. This wastes resources and clogs logs. Detect and prevent this.
Smart retries: The agent calls the same tool but adjusts parameters based on the error. If “search_by_id” fails, it tries “search_by_email”. This is desirable behavior.
Alternative strategies: The agent abandons the failing approach entirely and tries something different. If searching fails, it asks the user for clarification instead. This is often the best outcome.
Your error handling should encourage smart retries and alternative strategies, while preventing dumb retries.
Circuit Breakers for Agent Workflows
Traditional circuit breakers stop calling a failing service after N consecutive failures.
Agent circuit breakers are similar but need tuning:
Per-tool circuit breakers: If a specific tool fails repeatedly, disable it temporarily. The agent can’t call it even if it wants to. This prevents cascading failures.
Global failure thresholds: If the agent encounters errors on >50% of tool calls in a session, something’s fundamentally wrong. Halt the workflow and escalate to human review.
Adaptive timeouts: If a tool usually responds in <1s but starts taking >5s, it’s degraded. Reduce timeout thresholds to fail fast rather than waiting.
Fallback Hierarchies
When an approach fails, the agent needs alternatives.
Primary strategy fails, try secondary: If deep research using multiple tools doesn’t work, fall back to a simple search-and-summarize approach.
Tool unavailable, use approximation: If the weather API is down, use cached data or inform the user that current data isn’t available.
Can’t complete task, return partial result: If the agent can’t fully answer a question, return what it knows with caveats. “I found information about X and Y, but couldn’t verify Z.”
These fallbacks should be designed into your system, not left to the agent to invent. Provide clear guidance about when to fall back and what the fallback behavior should be.
Human-in-the-Loop for Unrecoverable Errors
Some errors can’t be handled autonomously. The agent needs to escalate.
Ambiguous user intent: If the agent doesn’t understand what the user wants, it should ask for clarification rather than guessing.
Missing capabilities: If the task requires a tool the agent doesn’t have access to, it should explain this limitation to the user rather than trying workarounds.
Authorization blocks: If the user lacks permission for a requested action, the agent should inform them, not attempt unauthorized workarounds.
Confidence threshold failures: If the agent’s confidence in its answer is below a threshold, flag for human review rather than presenting a low-confidence answer as fact.
Logging and Observability
When agents fail, you need rich context to debug.
Log every tool call: What was called, with what parameters, and what was returned (or what error occurred).
Log LLM reasoning: If your agent uses chain-of-thought or similar patterns, log the reasoning steps. This shows you what the agent was thinking when it made a bad decision.
Track decision points: When the agent chooses between multiple tools or strategies, log why it made that choice.
Trace end-to-end flows: Connect all actions in a single agent session so you can see the full sequence of events leading to failure.
This isn’t just for debugging—it’s for improving the system. Patterns in failures reveal design flaws, missing tools, or unclear descriptions.
Proactive Error Prevention
The best error handling is preventing errors in the first place.
Constrain the action space: Limit which tools are available at which points in the workflow. If certain tools only make sense after authentication, don’t expose them until authentication succeeds.
Validate preconditions: Before allowing a tool call, check that prerequisites are met. This catches semantic misuse before it causes problems.
Sanity checks on tool outputs: If a tool returns an unexpected format or obviously wrong data, flag it before passing it to the agent. This prevents garbage-in-garbage-out cascades.
Rate limiting per tool: Prevent the agent from spamming any single tool. If it calls the same tool >10 times in 60 seconds, something’s wrong.
Testing for Error Scenarios
You can’t anticipate all failures, but you can test the common ones.
Inject failures: In testing, make tools fail deliberately. Does the agent retry appropriately? Does it fall back to alternatives? Does it handle errors gracefully or spiral into confusion?
Test edge cases: What happens when tools return empty results? Malformed data? Extremely large responses? Partial failures?
Test cascading failures: Simulate scenarios where Tool A succeeds but returns misleading data, causing Tool B to fail. Can the agent recover?
Adversarial testing: Try to break the agent. Give it contradictory instructions, nonsense inputs, or requests it can’t fulfill. How does it behave?
The Mental Model
Traditional error handling: anticipate specific failures, write specific handlers.
Agent error handling: assume failures you didn’t anticipate, provide adaptive recovery mechanisms.
This means:
- Error messages that explain what went wrong and suggest alternatives
- Retry logic that encourages smart adaptation, not blind repetition
- Circuit breakers that prevent cascading failures
- Fallback strategies for when primary approaches fail
- Human escalation for unrecoverable errors
- Rich logging to understand novel failure modes
Agents will fail. Your job isn’t to prevent all failures—it’s to ensure failures are recoverable, observable, and informative. Build systems that degrade gracefully, and you’ll have agents that are resilient in production.