Agent guardrails without lobotomizing the agent

Adding guardrails to an agent is one of those tasks where the easy version is too restrictive and the careful version is too permissive. Block too aggressively and the agent refuses tasks that were perfectly fine; block too leniently and the headlines write themselves. The first attempt is always either the chatbot that won’t do anything or the agent that should not have done that.

Why string-based blocking is the wrong layer

Filtering agent actions on keywords or tool-name regex catches yesterday’s attacks and not tomorrow’s. The agent will paraphrase its way around any blocklist long enough to be useful. The signal you actually want is intent, not surface text — and intent isn’t a string-match problem.

Layered defenses that work

Constrain at the tool layer: dangerous tools should require explicit user confirmation, regardless of what the agent says. Constrain at the data layer: the agent shouldn’t have access to credentials or PII it doesn’t need, even if the user asks for it. Constrain at the policy layer: a separate model running policy checks on the agent’s plans, before execution, catches the cases where intent is genuinely off. Constrain at the audit layer: log every action with enough context that humans can review the borderline cases. Each layer fails sometimes; together they fail rarely.

Guardrails are not a feature you add to your agent. They’re a property of the architecture the agent runs in.

Related Posts

When the agent fails: recovery patterns that don't loop forever

Agent failures don't throw exceptions. They produce plausible-looking output that's wrong, or quietl ...

Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is ...

The Agent Harness: Why Your Model Isn't the Problem

LangChain jumped from outside the top 30 to number 5 on TerminalBench 2.0. They didn't change the mo ...

Planner-executor splits: when to separate them

A single model doing both planning and execution feels elegant on day one. By month three, the trace ...

Tool selection: when the model should pick, and when you should

Tool-using agents look powerful in demos because the model is choosing what to do next. They look fr ...

File-Based Agents Don't Need a Build Step

The investment banking analyst who spends Friday night formatting a pitch deck isn't doing analysis. ...