Agent Loops and Knowing When to Stop
— Agents iterate until they solve the problem—or until they loop forever, burning tokens and accomplishing nothing. Here's how to design termination conditions that work.
An agent is supposed to keep working until it solves the problem. But what if it never solves the problem? What if it gets stuck in a loop, calling the same tool repeatedly with slightly different parameters, never making progress?
You’ve seen this in production: an agent that burns through your API quota searching for information it will never find. An agent that keeps retrying a failed operation, convinced the next attempt will work. An agent that spirals into increasingly irrelevant actions because it lost track of the original goal.
Infinite loops don’t just waste money—they degrade user experience, create load on downstream systems, and make debugging nightmares. Getting termination right is essential.
The Naive Approach: Max Iterations
The simplest solution: set a hard limit. “The agent can iterate at most 10 times, then it stops.”
This prevents infinite loops, but it’s brittle:
- If the limit is too low, the agent fails on legitimate complex tasks
- If the limit is too high, it wastes resources on hopeless tasks
- The right limit varies by task—simple queries need 2-3 iterations, complex research might need 20
Max iterations is a safety net, not a strategy. You should have one, but don’t rely on it as your primary termination condition.
Goal Achievement Detection
The best termination condition is success: the agent accomplished what it set out to do.
But how do you detect that programmatically?
Explicit completion signals: Many agent frameworks have the LLM return a “done” signal when it believes the task is complete. The LLM calls a special “finish” tool with the final answer, and the loop terminates.
This works well when the task has a clear deliverable: “research this topic and provide a summary.” The agent knows it’s done when it has a complete summary.
It fails when tasks are ambiguous or open-ended. “Help the user” doesn’t have a clear completion point. The agent might decide it’s done prematurely, or keep searching for ever-better answers.
Validation checks: For some tasks, you can programmatically verify success. If the agent’s goal is to fix failing tests, run the tests after each iteration. When they pass, terminate. If the agent’s goal is to find a record in a database, check whether the record was found.
This requires that you can define success criteria in code. That’s possible for well-structured tasks (write code that compiles, generate valid JSON, retrieve specific data) but not for subjective tasks (write engaging content, summarize this document concisely).
Detecting Stagnation
Sometimes the agent neither succeeds nor explicitly fails—it just stops making progress.
Repeated actions: If the agent calls the same tool with the same parameters multiple times, it’s stuck. Detect this by tracking the last N actions and checking for duplicates.
Oscillation: The agent tries action A, then action B, then action A again, cycling endlessly. This is harder to detect than exact repetition, but you can flag when recent actions show a repeating pattern.
No new information: If the agent keeps acting but observations aren’t changing, it’s not making progress. Track whether each iteration produces novel information or just rehashes what was already known.
When stagnation is detected, you can:
- Force termination and return a partial result
- Switch strategies (if the agent supports meta-reasoning about its approach)
- Escalate to human review
Failure Detection
Some tasks are impossible, and the agent should recognize this and stop.
Explicit failure signals: The LLM can call a “fail” tool to indicate it’s stuck. “I’ve tried multiple approaches and can’t solve this problem.” This requires that the agent is willing to admit failure, which isn’t always reliable—LLMs are often overconfident.
Resource exhaustion: If the agent runs out of available tools or data sources, it can’t make further progress. Detect when the agent has tried all available options without success.
Error accumulation: If the agent encounters repeated tool failures (rate limits, permission errors, network timeouts), continuing is unlikely to help. After N consecutive failures, terminate and surface the error.
Cost and Latency Budgets
Even if the agent could eventually succeed, you might not want to wait.
Token limits: Each iteration costs tokens. Set a budget (e.g., 50,000 tokens) and terminate when it’s exceeded, regardless of completion status.
Time limits: Users won’t wait 5 minutes for a response. Set a wall-clock timeout and return the best partial result if the timeout is reached.
Iteration budgets by task type: Simple queries get 5 iterations, complex tasks get 20, research tasks get 50. Tailor limits to expected complexity.
These are guardrails to prevent runaway costs, not ideal termination conditions. Ideally, the agent succeeds or fails gracefully before hitting budget limits.
Progressive Prompting
One way to encourage timely termination: remind the agent of iteration limits as it approaches them.
“You have 10 iterations remaining. Focus on the most promising approach.”
“You have 3 iterations remaining. If you haven’t solved the problem by then, provide the best partial answer you can.”
This gives the agent a sense of urgency and encourages it to prioritize high-value actions over exhaustive search.
Partial Results and Graceful Degradation
The agent doesn’t have to fully succeed or completely fail. Often, the best outcome is a useful partial result.
Confidence scoring: If the agent can’t definitively answer a question, it can return its best guess with a confidence score. “I’m 70% confident the answer is X based on these sources.”
Provenance tracking: Return not just the answer, but the reasoning and sources that led to it. Even if the answer is incomplete, users can evaluate its trustworthiness.
Fallback strategies: If the primary approach fails, try a simpler backup. If deep research doesn’t work, return a basic search result. Something is better than nothing.
Multi-Agent Termination
When multiple agents collaborate, termination is more complex.
All agents must agree: A research agent and a writing agent might both need to confirm they’ve completed their parts before the overall task is done.
Sequential dependencies: Agent A must finish before Agent B can start. If Agent A fails or times out, Agent B never runs.
Voting or consensus: If multiple agents attempt the same task, they terminate when N out of M agents agree on a result.
The orchestration layer needs clear rules about when the overall system is done, not just when individual agents finish.
What to Return When Stopping
When the loop ends, return something useful:
On success: The final answer, plus metadata (number of iterations, tools called, confidence score)
On failure: An explanation of what went wrong, what was attempted, and why it didn’t work
On timeout: The best partial result, with a clear indication that it’s incomplete
Users need to know whether the agent succeeded, failed, or ran out of time. Don’t return a confident-sounding but wrong answer just because the agent ran out of iterations.
Logging for Post-Mortem Analysis
When an agent stops, log enough information to understand why:
- Number of iterations executed
- Sequence of actions and observations
- Termination reason (success, failure, timeout, stagnation, etc.)
- Final state and any partial results
This lets you analyze patterns: do agents usually timeout on certain query types? Is stagnation detection too aggressive? Are max iteration limits too conservative?
The Right Mental Model
Termination isn’t just a technical detail—it’s a fundamental design question. What does success look like? When is partial success acceptable? How much cost and latency are you willing to tolerate?
Agents that iterate indefinitely are unusable. Agents that give up too quickly are unhelpful. The right balance depends on your use case, and getting it right requires careful thought about goals, constraints, and failure modes.
Build explicit termination conditions, monitor them in production, and tune based on real usage. Termination logic is as important as the agent’s reasoning—it’s what makes the difference between a useful tool and a runaway process that burns resources and delivers nothing.