Tool selection: when the model should pick, and when you should

Tool-using agents look powerful in demos because the model is choosing what to do next. They look fragile in production because the model is choosing what to do next. The space of available tools grows linearly with features and quadratically with edge cases — past about a dozen tools, the model starts conflating their roles and picking based on surface similarity in the tool name.

What goes wrong as tool count grows

Beyond ten or fifteen tools, the descriptions blur together in the model’s representation. The model picks a search tool when a database lookup was correct, because both have “lookup” in their description. It picks the simpler tool when the complex one was needed, because the simpler one matched the user phrasing. None of this shows up in single-call testing — it surfaces when one of the tools quietly handles a request another tool was supposed to handle, and the answer is technically valid but operationally wrong.

Architectural answers, not prompt answers

Group tools by purpose and route the request to a sub-agent that only sees the relevant subset. Surface fewer tools to the top-level model than you actually expose internally — five visible tools with clear purposes outperforms twenty undifferentiated ones. For destructive or expensive tools, require an explicit naming match, not a model-chosen one.

The number of tools an agent should choose from is much smaller than the number of tools you’d like to give it. Past a threshold, every additional tool makes every other choice worse.

Related Posts

When the agent fails: recovery patterns that don't loop forever

Agent failures don't throw exceptions. They produce plausible-looking output that's wrong, or quietl ...

Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is ...

Agent guardrails without lobotomizing the agent

Adding guardrails to an agent is one of those tasks where the easy version is too restrictive and th ...

The Agent Harness: Why Your Model Isn't the Problem

LangChain jumped from outside the top 30 to number 5 on TerminalBench 2.0. They didn't change the mo ...

Planner-executor splits: when to separate them

A single model doing both planning and execution feels elegant on day one. By month three, the trace ...

Postiz Agent CLI: Hand Your AI the Keys to 28 Social Platforms

You built an agent that reads RSS feeds, summarizes papers, and generates cover images. Then you hit ...

File-Based Agents Don't Need a Build Step

The investment banking analyst who spends Friday night formatting a pitch deck isn't doing analysis. ...

Designing an agent harness that doesn't fight the model

Lorem ipsum dolor sit amet consectetur adipisicing elit. The harness around an agent matters more th ...

Memory strategies for long-running agents

Long-running agents accumulate context. The job of memory design is to decide which slices of that c ...

How autonomous is too autonomous

Autonomy in agents is a slider, not a switch, and the right setting depends on the task more than th ...

Agent memory: episodic, semantic, and what to keep

The first agent you build has no memory beyond the current conversation, and that works for about a ...