The Agent Harness: Why Your Model Isn't the Problem
- William Jacob
- Agents , Architecture
- 11 May, 2026
LangChain jumped from outside the top 30 to number 5 on TerminalBench 2.0. They didn’t change the model. Same LLM. Same parameters. The only thing that changed was the software wrapped around it — the harness.
Another team went further: they let an LLM optimize the harness itself and hit a 76.4% pass rate, beating systems designed by human engineers. The pattern is consistent. The model is rarely your bottleneck. The infrastructure around it almost always is.
Anthropic, OpenAI, Perplexity, and LangChain are all building the same thing — each under a different name, each with different trade-offs. In early 2026, the industry settled on a term for it: the AI Agent Harness.
What a harness actually is
A harness is everything outside the model that makes an agent work. The orchestration loop that runs until the task is done. The tool definitions that give the model hands. The memory system that remembers across sessions. The context management that decides what the model sees and when. The error recovery that catches failures before they cascade. The guardrails that say no when the model wants to do something dangerous.
Vivek Trivedy from LangChain put it bluntly: “If you are not the model, you are the Harness.”
Beren Millidge made the analogy sharper. A raw LLM is like a CPU with no RAM, no disk, and no I/O. The context window is RAM — fast but small. External databases are disk — large but slow. Tools are device drivers. The harness is the operating system. We reinvented Von Neumann architecture, not because it’s clever, but because it’s the most natural abstraction for any computing system.
Anthropic describes their runtime as a “dumb loop.” All the intelligence lives in the model. The harness just manages turn-taking, tool execution, and state. The loop is simple. The complexity is in everything the loop has to handle.
The three layers of agent engineering
Engineering around a model happens in three concentric layers:
Prompt engineering — crafting what the model receives as instructions. This is the innermost layer and the one most people start with.
Context engineering — managing what the model sees at what point in time. What’s in the window right now? What got summarized? What got dropped? This is where “lost in the middle” kills performance.
Harness engineering — all of the above, plus the full application architecture: tool orchestration, state persistence, error recovery, verification loops, security, and lifecycle management. This is the layer that makes or breaks production agents.
Most teams over-invest in prompt engineering and under-invest in harness engineering. The TerminalBench results suggest they have the ratio backwards.
The 12 components, grouped by what breaks first
A production-grade harness has twelve distinct components. You don’t need all twelve on day one. But you need to know which ones will fail first when you push to production.
The heartbeat: orchestration loop
This is the while-loop that runs think → act → observe until the task completes. It assembles the prompt, calls the model, parses the output, executes tool calls, feeds results back, and repeats. Simple to describe, hard to implement well — because the loop has to handle every state the model can generate, including the ones you didn’t anticipate.
The hands: tools
Tools are structured schemas injected into the model’s context so it knows what it can do. The harness handles registration, parameter extraction, sandboxed execution, and result formatting. Claude Code ships six tool categories: files, search, execution, web access, code analysis, and subagent spawning. OpenAI’s SDK supports function tools, managed tools (web search, code interpreter), and MCP server tools.
The key insight: tools should be scoped to the current step, not the entire agent. Past about a dozen tools visible at once, the model starts picking wrong.
The memory: short, long, and index
Memory operates on different timescales. Short-term memory is the conversation history within a session. Long-term memory persists across sessions — files on disk, JSON stores organized by namespace, SQLite-backed sessions.
Claude Code implements a three-tier architecture: a lightweight index of roughly 150-character entries that’s always loaded, detailed topic files loaded on demand, and raw conversation transcripts accessible only through search. The design principle: agents treat their own memories as hints, not truth. Every recalled fact gets verified against current state before being acted on.
The context window: what the model actually sees
Context management is where most agents quietly fail. Critical information in the middle of the window degrades model performance by over 30% — Stanford’s “lost in the middle” finding. Even million-token windows show degrading instruction-following as context grows.
Production strategies: compaction that summarizes conversation history while preserving architectural decisions and unresolved bugs; observation masking that hides old tool outputs but keeps the call records; just-in-time retrieval that loads data on demand instead of pre-loading everything; and subagent delegation where each subagent explores deeply but returns only a condensed summary of 1,000–2,000 tokens.
The goal, as Anthropic’s context engineering guide states: find the smallest set of tokens with the strongest signal that maximizes the probability of achieving your objective.
The glue: prompt construction, output parsing, state
Prompt construction is hierarchical: system prompt at the top, then tool definitions, memory files, conversation history, and the current user message. OpenAI’s Codex uses a strict priority stack where server-controlled system messages override everything below them.
Output parsing relies on native tool calling — the model returns structured tool_call objects, not free text to regex apart. The harness checks: is there a tool call? Execute it and loop. No tool call? This is the final answer.
State management varies by framework. LangGraph models state as typed dictionaries flowing through graph nodes with checkpointing at key steps — enabling interrupt-and-resume and even time-travel debugging. Claude Code uses git commits as checkpoints and progress files as structured scratchpads.
The safety net: error handling, guardrails, verification
Error handling matters because failures compound. A 10-step process where each step has a 99% success rate only achieves roughly 90% end-to-end reliability. LangGraph classifies errors into four categories: transient (retry with backoff), model-recoverable (return the error as a tool message and let the model adjust), human-fixable (pause and wait), and unexpected (escalate for debugging).
Guardrails operate in layers: input guardrails check before the first agent runs, output guardrails check the final result, tool guardrails check before every tool call. Tripwire triggers stop the agent immediately.
Verification loops separate toy demos from production systems. Anthropic recommends three approaches: rule-based feedback (tests, linters), visual feedback (UI screenshots via Playwright), and LLM-as-judge (a subagent that evaluates output). Claude Code’s creator Boris Cherny notes that giving the model the ability to verify its own work improves output quality by 2–3x.
The fleet: subagent orchestration
Claude Code supports three subagent modes: fork (clone parent context), teammate (independent window communicating through file-based mailbox), and worktree (isolated git branch). OpenAI supports agents as tools (experts handle specific subtasks) and handoffs (experts take over subsequent control). The pattern is the same: deep exploration happens in subagents, summaries come back to the orchestrator.
Where harness design goes wrong
Building too much harness too early. The scaffolding metaphor isn’t decorative — construction scaffolding is temporary. It lets workers reach heights they couldn’t otherwise, but it comes down when the building is done. As models improve, your harness should get thinner, not thicker. If you add complexity every time you upgrade your model, you’re doing it backward.
Over-engineering the loop. Anthropic’s “dumb loop” philosophy is correct for most use cases. The model is smarter than your hand-crafted routing logic. Give it good tools, good context, and a clear objective, then get out of its way.
Confusing memory with truth. Claude Code’s architecture explicitly treats stored memories as hints requiring verification. Treating them as ground truth leads to agents that confidently act on stale information.
Skipping verification. The jump from demo to production is the jump from “the model says it’s right” to “the system proves it’s right.” If your agent doesn’t verify its outputs, you’re shipping a demo.
How to think about harness design today
Start simple. One agent. One loop. A few well-scoped tools. Get the verification loop right before adding anything else. The TerminalBench evidence is clear: a well-designed thin harness on a good model beats a complex harness on a great model.
The seven decisions every harness architect faces:
- Single agent or multi-agent — exhaust single-agent capability first
- ReAct or plan-then-execute — ReAct is flexible but costly; planning is faster
- Compact or retrieve — summarize conversation or load on demand
- Rule-based or LLM-as-judge — hard tests or another model scoring outputs
- Auto-approve or step-by-step — speed versus safety on tool execution
- Tool scope — expose fewer tools than you have; scope to the current step
- Harness thickness — how much logic lives in code versus in the model
The co-evolution principle: models are now trained with harnesses in mind. If your harness is well-designed, model upgrades improve performance without you adding complexity. The harness gets thinner over time, not thicker.
Two agents running the same model can have wildly different performance. The difference is always the harness. TerminalBench proved it with a 25-spot swing. Your production logs are proving it right now.
Next time your agent underperforms, don’t reach for a better model. Reach for a better harness. Start with the loop. Then memory. Then verification. Ship when the system proves it’s right — not when the model says it is.