Agent guardrails without lobotomizing the agent

Adding guardrails to an agent is one of those tasks where the easy version is too restrictive and the careful version is too permissive. Block too aggressively and the agent refuses tasks that were perfectly fine; block too leniently and the headlines write themselves. The first attempt is always either the chatbot that won’t do anything or the agent that should not have done that.

Why string-based blocking is the wrong layer

Filtering agent actions on keywords or tool-name regex catches yesterday’s attacks and not tomorrow’s. The agent will paraphrase its way around any blocklist long enough to be useful. The signal you actually want is intent, not surface text — and intent isn’t a string-match problem.

Layered defenses that work

Constrain at the tool layer: dangerous tools should require explicit user confirmation, regardless of what the agent says. Constrain at the data layer: the agent shouldn’t have access to credentials or PII it doesn’t need, even if the user asks for it. Constrain at the policy layer: a separate model running policy checks on the agent’s plans, before execution, catches the cases where intent is genuinely off. Constrain at the audit layer: log every action with enough context that humans can review the borderline cases. Each layer fails sometimes; together they fail rarely.

Guardrails are not a feature you add to your agent. They’re a property of the architecture the agent runs in.

Agent guardrails without lobotomizing the agent

Why string-based blocking is the wrong layer

Layered defenses that work

Tags :

Share :

Related Posts

When the agent fails: recovery patterns that don't loop forever

Evaluating agents when there's no single right answer

The Agent Harness: Why Your Model Isn't the Problem

Planner-executor splits: when to separate them

Tool selection: when the model should pick, and when you should

File-Based Agents Don't Need a Build Step