Natural-Language Agent Harnesses: Making AI Agent Control Logic Portable and Executable
Modern AI agents succeed or fail based on their harness—the control stack that orchestrates multi-step reasoning, tool use, memory, and delegation. Yet this critical component remains buried in framework code, making it impossible to compare, transfer, or study systematically.
Researchers from Tsinghua University propose a solution: Natural-Language Agent Harnesses (NLAHs) that express control logic in editable text, executed by an Intelligent Harness Runtime (IHR) that interprets these harnesses directly.
The Hidden Problem in Agent Systems
Agent performance increasingly depends on harness engineering—the orchestration layer governing multiple model calls, tool interactions, and state management. Current harnesses scatter their logic across controller code, framework defaults, and runtime assumptions, creating three major problems:
- Transfer difficulty: Harnesses can’t move between different runtime environments
- Comparison challenges: Two systems that appear to differ by one design choice actually differ in prompts, tool mediation, and state semantics simultaneously
- Study limitations: Harness logic remains opaque, preventing systematic ablation and improvement
This forces researchers to compare entire controller bundles rather than isolating specific harness patterns.
Natural-Language Harnesses as Executable Objects
NLAHs externalize harness control logic as structured natural-language representations with five core components:
Contracts: Required inputs/outputs, validation gates, permission boundaries, and stopping rules
Roles: Non-overlapping responsibilities for solver, verifier, researcher, and orchestrator agents
Stage Structure: Explicit workflow topology (plan → execute → verify → repair)
Adapters and Scripts: Named hooks for deterministic actions like tests, verifiers, and parsing
State Semantics: What persists across steps and how it reopens through paths and manifests
The Intelligent Harness Runtime places an LLM inside the execution loop to interpret these natural-language specifications while providing tool access and multi-agent coordination.
Controlled Experimental Evidence
The researchers evaluated their approach across coding (SWE-bench Verified) and computer-use (OSWorld) benchmarks, examining three key questions:
Behavioral Impact
Full IHR systems show dramatically different process signatures than ablated versions. On SWE-bench, TRAE harnesses under full IHR consumed 16.3M prompt tokens versus 1.2M without harness logic, with 90% of usage occurring in delegated child agents rather than the parent thread. This demonstrates genuine multi-stage orchestration rather than simple prompt decoration.
Module Composition
Individual harness modules can be composed and ablated systematically. Self-evolution modules improve solve loops by enforcing disciplined acceptance gates. File-backed state modules improve process structure and auditability. However, more structure doesn’t automatically mean better performance—verifier and multi-candidate search modules can reshape success signals in ways that diverge from benchmark acceptance.
Code-to-Text Migration
Migrating OS-Symphony from native code to NLAH representation improved performance from 30.4% to 47.2% task success. The migration shifted control from screenshot-grounded repair loops to file-backed state and artifact-backed verification, demonstrating that natural-language harnesses can preserve functional behavior while relocating reliability mechanisms.
Practical Implications
This work enables several advances in agent development:
Harness Portability: Control logic becomes transferable between runtime environments
Systematic Comparison: Researchers can isolate harness pattern effects from implementation details
Module Reuse: Proven harness components can be shared and recombined across projects
Scientific Study: Harness engineering becomes a controlled, ablatable research object
Implementation Considerations
Natural language carries editable orchestration logic while deterministic code handles low-level operations. The runtime charter defines shared semantics for contracts, state management, and child lifecycle, enabling different harness skills to execute under common assumptions.
The approach works best when harness modules tighten the path from intermediate behavior to final acceptance, and less well when they add process layers whose success criteria diverge from benchmark goals.
Looking Forward
Once harnesses become explicit objects, they open new possibilities for automated search and optimization over harness representations. This could transform harness engineering from opaque bundle development into systematic pattern discovery and reuse.
The research demonstrates that agent control logic can be externalized as readable, executable artifacts under shared runtime semantics—a crucial step toward making agent harness engineering a more scientific and collaborative discipline.