Designing Effective Harnesses for LLM Agents

LLM agents increasingly rely on harnesses—scaffolding systems that decompose tasks and guide execution—to handle complex, long-horizon problems. Yet more elaborate harnesses don’t always improve performance. This research reveals when harnesses help and when they hinder agent success.

The Harness Design Problem

Harnesses operate on two timescales. The outer timescale sets sub-goals that break tasks into manageable pieces. The inner timescale provides guidance that shapes how agents act within each stage. This separation reveals a fundamental tension: human-designed structure can improve short-term reliability but may limit an agent’s ability to adapt and search effectively.

The key insight is that harness design is an alignment problem. Structure must match the agent’s capabilities and available evidence, not just the task’s logical requirements.

Three Alignment Principles

Granularity-Capability Alignment

Sub-goal granularity must match what agents can reliably accomplish within their retry budget. The research shows that optimal decomposition depends on the relationship between requested progress and the agent’s controllable progress scale.

Too coarse: Sub-goals exceed what agents can reach within available attempts Too fine: Sub-goals fall below the agent’s natural action scale, accumulating coordination errors

The sweet spot aligns sub-goal size with the agent’s reachable progress scales. This explains why finer workflows aren’t uniformly better—they can fragment tasks into milestones agents cannot meaningfully stop at.

Guidance-Evidence Alignment

Guidance helps only when it concentrates probability on trajectories that preserve successful continuations. The effect depends on a single quantity: the retention gap between how much weight guidance places on recoverable versus non-recoverable trajectories.

Positive retention gap: Guidance favors evidence-grounded trajectories, reducing hallucination Negative retention gap: Guidance favors instruction compliance over evidence, amplifying hallucination

This explains why the same guidance can either reduce or increase hallucination depending on whether it tracks task evidence or just follows instructions.

Partial Harnessing

Effective harnesses need not specify the complete execution path. The research introduces partial harnessing: specify initial stages to guide agents into the right search space, then let them plan and adapt autonomously.

The optimal harness length follows a marginal stopping rule: keep adding scaffolded stages only while the next stage saves more tail risk than it introduces. Stronger agents reach this stopping point earlier because they handle longer autonomous tails more reliably.

Experimental Validation

Controlled experiments on synthetic addition tasks and real Terminal-Bench evaluations confirm these principles:

Pass rates peak at intermediate granularities, declining when decomposition becomes too coarse or too fine
Guidance effects depend on alignment: aligned guidance improves performance as strength increases, while misaligned guidance makes things worse
Partial harnesses outperform full specifications when the remaining task falls within the agent’s autonomous capability

A case study on Terminal-Bench shows a partial harness with only three initial steps outperforming a fully specified 10-step workflow. The partial version guides the agent into the right approach, then allows autonomous completion. The full version over-constrains execution, causing the agent to get stuck in prescribed intermediate steps.

Practical Implications

These findings reframe harness engineering from “add more structure” to “choose what to specify”:

Match granularity to capability: Test different decomposition levels to find where sub-goal size aligns with your agent’s progress scale
Align guidance with evidence: Ensure guidance rules favor trajectories supported by available information, not just instruction compliance
Stop scaffolding strategically: Specify enough structure to guide initial direction, then let agents handle the rest autonomously

The right harness is the smallest one that keeps agents on a recoverable trajectory—and no smaller.

Implementation Guidelines

When designing harnesses:

Start with coarse decomposition and refine only if agents consistently fail to make required progress
Design guidance rules that explicitly check evidence rather than just following format requirements
Test partial versions that specify fewer stages than your initial instinct suggests
Monitor whether additional structure improves final success or just creates more coordination overhead

The goal is not maximum structure but optimal alignment between imposed constraints and agent capabilities. Sometimes the best harness is the one that knows when to stop helping.

Harnesses for Inference-Time Alignment over Execution Trajectories