Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is a different category of hard. The trajectories that produce a correct answer rarely match exactly. The trajectories that produce a wrong answer often look reasonable until step seven. Standard exact-match scoring is useless here, and reviewers burn out fast on long-form trace inspection.

What ends up actually working

Three signals do the heavy lifting. Outcome correctness — did the final answer match the ground truth — is necessary but not sufficient. Trajectory cost — number of steps, total tokens, total tool calls — catches the agents that get the right answer the wrong way. And subgoal progress — did the agent advance through expected milestones — catches the silent-failure cases where the agent reaches the answer by accident.

Building the eval set

Hand-curate twenty trajectories before you spend a dollar on automation. The first twenty teach you what signals matter for your task. After that, LLM-as-judge with a careful rubric scales further than human review, but only if you’ve calibrated the judge against your hand-labeled set. Skip that calibration and the judge will agree with itself confidently and wrongly.

Agent eval looks like a metrics problem and is actually a labeling problem. The teams that ship reliable agents have invested in trajectory datasets that the rest of the field would consider tedious.

Related Posts

When the agent fails: recovery patterns that don't loop forever

Agent failures don't throw exceptions. They produce plausible-looking output that's wrong, or quietl ...

Agent guardrails without lobotomizing the agent

Adding guardrails to an agent is one of those tasks where the easy version is too restrictive and th ...

The Agent Harness: Why Your Model Isn't the Problem

LangChain jumped from outside the top 30 to number 5 on TerminalBench 2.0. They didn't change the mo ...

Planner-executor splits: when to separate them

A single model doing both planning and execution feels elegant on day one. By month three, the trace ...

Tool selection: when the model should pick, and when you should

Tool-using agents look powerful in demos because the model is choosing what to do next. They look fr ...

File-Based Agents Don't Need a Build Step

The investment banking analyst who spends Friday night formatting a pitch deck isn't doing analysis. ...

Designing an agent harness that doesn't fight the model

Lorem ipsum dolor sit amet consectetur adipisicing elit. The harness around an agent matters more th ...

Memory strategies for long-running agents

Long-running agents accumulate context. The job of memory design is to decide which slices of that c ...

How autonomous is too autonomous

Autonomy in agents is a slider, not a switch, and the right setting depends on the task more than th ...

Agent memory: episodic, semantic, and what to keep

The first agent you build has no memory beyond the current conversation, and that works for about a ...