Natural-Language Agent Harnesses: Making AI Agent Control Logic Portable and Executable

Modern AI agents succeed or fail based on their harness—the control stack that orchestrates multi-step reasoning, tool use, memory, and delegation. Yet this critical component remains buried in framework code, making it impossible to compare, transfer, or study systematically.

Researchers from Tsinghua University propose a solution: Natural-Language Agent Harnesses (NLAHs) that express control logic in editable text, executed by an Intelligent Harness Runtime (IHR) that interprets these harnesses directly.

The Hidden Problem in Agent Systems

Agent performance increasingly depends on harness engineering—the orchestration layer governing multiple model calls, tool interactions, and state management. Current harnesses scatter their logic across controller code, framework defaults, and runtime assumptions, creating three major problems:

Transfer difficulty: Harnesses can’t move between different runtime environments
Comparison challenges: Two systems that appear to differ by one design choice actually differ in prompts, tool mediation, and state semantics simultaneously
Study limitations: Harness logic remains opaque, preventing systematic ablation and improvement

This forces researchers to compare entire controller bundles rather than isolating specific harness patterns.

Natural-Language Harnesses as Executable Objects

NLAHs externalize harness control logic as structured natural-language representations with five core components:

Contracts: Required inputs/outputs, validation gates, permission boundaries, and stopping rules

Roles: Non-overlapping responsibilities for solver, verifier, researcher, and orchestrator agents

Stage Structure: Explicit workflow topology (plan → execute → verify → repair)

Adapters and Scripts: Named hooks for deterministic actions like tests, verifiers, and parsing

State Semantics: What persists across steps and how it reopens through paths and manifests

The Intelligent Harness Runtime places an LLM inside the execution loop to interpret these natural-language specifications while providing tool access and multi-agent coordination.

Controlled Experimental Evidence

The researchers evaluated their approach across coding (SWE-bench Verified) and computer-use (OSWorld) benchmarks, examining three key questions:

Behavioral Impact

Full IHR systems show dramatically different process signatures than ablated versions. On SWE-bench, TRAE harnesses under full IHR consumed 16.3M prompt tokens versus 1.2M without harness logic, with 90% of usage occurring in delegated child agents rather than the parent thread. This demonstrates genuine multi-stage orchestration rather than simple prompt decoration.

Module Composition

Individual harness modules can be composed and ablated systematically. Self-evolution modules improve solve loops by enforcing disciplined acceptance gates. File-backed state modules improve process structure and auditability. However, more structure doesn’t automatically mean better performance—verifier and multi-candidate search modules can reshape success signals in ways that diverge from benchmark acceptance.

Code-to-Text Migration

Migrating OS-Symphony from native code to NLAH representation improved performance from 30.4% to 47.2% task success. The migration shifted control from screenshot-grounded repair loops to file-backed state and artifact-backed verification, demonstrating that natural-language harnesses can preserve functional behavior while relocating reliability mechanisms.

Practical Implications

This work enables several advances in agent development:

Harness Portability: Control logic becomes transferable between runtime environments

Systematic Comparison: Researchers can isolate harness pattern effects from implementation details

Module Reuse: Proven harness components can be shared and recombined across projects

Scientific Study: Harness engineering becomes a controlled, ablatable research object

Implementation Considerations

Natural language carries editable orchestration logic while deterministic code handles low-level operations. The runtime charter defines shared semantics for contracts, state management, and child lifecycle, enabling different harness skills to execute under common assumptions.

The approach works best when harness modules tighten the path from intermediate behavior to final acceptance, and less well when they add process layers whose success criteria diverge from benchmark goals.

Looking Forward

Once harnesses become explicit objects, they open new possibilities for automated search and optimization over harness representations. This could transform harness engineering from opaque bundle development into systematic pattern discovery and reuse.

The research demonstrates that agent control logic can be externalized as readable, executable artifacts under shared runtime semantics—a crucial step toward making agent harness engineering a more scientific and collaborative discipline.

Natural-Language Agent Harnesses: Making AI Agent Control Logic Portable and Executable

Natural-Language Agent Harnesses: Making AI Agent Control Logic Portable and Executable

The Hidden Problem in Agent Systems

Natural-Language Harnesses as Executable Objects

Controlled Experimental Evidence

Behavioral Impact

Module Composition

Code-to-Text Migration

Practical Implications

Implementation Considerations

Looking Forward

How We Built Our Multi-Agent Research System

Language Model Teams as Distributed Systems: A Framework for Multi-Agent Coordination

Superagency in the workplace: Empowering people to unlock AI's full potential