When the agent fails: recovery patterns that don't loop forever

Agent failures don’t throw exceptions. They produce plausible-looking output that’s wrong, or quietly retry the same broken approach in a slightly different way. Wrapping agents in try/catch is the wrong mental model — the agent didn’t crash, it just kept going in a useless direction. Recovery has to be designed in, not bolted on.

The failure modes that need different recovery

Tool failures — the API returned an error or timed out — are the easiest case: the agent should see the error and try a different approach. Reasoning failures — the agent is confidently wrong about what step comes next — are harder, because the agent doesn’t know it’s wrong. Loop failures — the agent retries the same approach over and over — are the worst, because each iteration looks productive in isolation.

Recovery patterns that survive contact with reality

Cap iteration count, always. Detect repetition in the action history — if the agent has called the same tool with similar arguments three times in a row, escalate or abort. For reasoning failures, a separate “is the current plan still right?” check, run periodically by a smaller model on the action log, catches the worst cases. None of this is glamorous, and all of it gets cut from the first version of every agent because it feels paranoid until the first time it fires.

Agent failure recovery is the part of the system that exists to keep small failures from becoming catastrophic. Skipping it is how you discover that “agentic” and “autonomous” are not the same word.

Related Posts

Memory strategies for long-running agents

Long-running agents accumulate context. The job of memory design is to decide which slices of that c ...

Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is ...

Agent guardrails without lobotomizing the agent

Adding guardrails to an agent is one of those tasks where the easy version is too restrictive and th ...

The Agent Harness: Why Your Model Isn't the Problem

LangChain jumped from outside the top 30 to number 5 on TerminalBench 2.0. They didn't change the mo ...

Planner-executor splits: when to separate them

A single model doing both planning and execution feels elegant on day one. By month three, the trace ...

Tool selection: when the model should pick, and when you should

Tool-using agents look powerful in demos because the model is choosing what to do next. They look fr ...

File-Based Agents Don't Need a Build Step

The investment banking analyst who spends Friday night formatting a pitch deck isn't doing analysis. ...

Designing an agent harness that doesn't fight the model

Lorem ipsum dolor sit amet consectetur adipisicing elit. The harness around an agent matters more th ...

How autonomous is too autonomous

Autonomy in agents is a slider, not a switch, and the right setting depends on the task more than th ...

Agent memory: episodic, semantic, and what to keep

The first agent you build has no memory beyond the current conversation, and that works for about a ...