Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

LLM agents have begun discovering real security vulnerabilities that human auditors and automated fuzzers missed for decades. However, most systems rely on hand-designed harnesses—the orchestration code that coordinates multiple agents. AgentFlow addresses this limitation by automatically synthesizing multi-agent harnesses using a typed graph DSL and feedback-driven optimization.

The Problem with Hand-Designed Harnesses

Single-agent vulnerability discovery systems break down for three reasons:

  • Context overflow: Real targets produce megabytes of output that exceed context windows
  • Lost-in-the-middle effects: Models drop earlier analysis when handling heterogeneous tasks
  • Sequential bottlenecks: Single traces cannot explore multiple hypotheses in parallel

Multi-agent systems solve these problems by splitting work across specialized agents, but someone must design the harness—the program that specifies which agents exist, how they communicate, and when they retry. On TerminalBench-2, three systems using the same Claude Opus 4.6 model achieve pass rates spanning 20% to 80%, with harness design accounting for the entire difference.

AgentFlow’s Approach

AgentFlow introduces two key innovations to automate harness synthesis:

Typed Graph DSL

AgentFlow represents every harness as a program in a typed domain-specific language where:

  • Nodes are agents with roles, prompts, models, and tools
  • Edges are dataflow or retry links between agents
  • Templates determine which feedback channels each agent reads

The DSL unifies all harness dimensions into a single searchable representation:

  • Agent roles (𝒜)
  • Communication topology (𝒢)
  • Message schemas (Σ)
  • Tool bindings (Φ)
  • Coordination protocol (Ψ)

A type system ensures well-formedness before expensive LLM evaluation, rejecting structurally broken candidates in linear time.

Feedback-Driven Optimization

Instead of binary pass/fail signals, AgentFlow reads structured runtime feedback from target programs:

  • Test verdicts: Pass/fail outcomes
  • Program output: Stdout/stderr messages
  • Coverage data: Which code lines executed
  • Sanitizer reports: Memory safety violations

This rich feedback enables precise failure diagnosis. Coverage reveals whether vulnerable code was reached; sanitizer output distinguishes real crashes from false positives.

The Optimization Loop

AgentFlow follows a four-phase cycle:

  1. Propose: Generate new harness based on diagnosis and archive
  2. Execute & Observe: Run harness on tasks, collect feedback
  3. Score: Evaluate performance (pass rate or unique crashes)
  4. Diagnose: Identify failure causes and suggest fixes

Each iteration can simultaneously add agents, rewire communication graphs, update prompts, and change coordination protocols—all as local rewrites of the same DSL program.

Results

TerminalBench-2 Performance

AgentFlow achieved 84.3% on TerminalBench-2 with Claude Opus 4.6, the highest score among publicly-ranked harnesses. The synthesis trajectory climbed from 35.2% to 84.3% through three phases:

  • Infrastructure (Steps 1-5): Tool bindings and coordination fixes (+28.8pp)
  • Specialization (Steps 6-9): Specialist agents and retry logic (+15.8pp)
  • Ensemble (Step 12): Fan-out/merge topology (+4.5pp)

Real-World Impact

The same synthesis loop discovered ten previously unknown zero-day vulnerabilities in Google Chrome using Kimi K2.5, including two Critical sandbox-escape CVEs (CVE-2026-5280 and CVE-2026-6297). All findings were confirmed through Chrome’s Vulnerability Reward Program.

Ablation Study

Disabling different edit types showed:

  • No prompt search: -32.5 percentage points
  • No structural search: -7.9 percentage points
  • No tool search: -12.4 percentage points

This demonstrates that prompt optimization provides the largest gains, while structural and tool edits contribute additional improvements.

Key Advantages

AgentFlow addresses fundamental limitations of prior harness optimizers:

Broader search space: Previous systems search only narrow slices (prompts only, topology only, agents only). AgentFlow searches all components simultaneously.

Richer feedback: Binary pass/fail signals provide no diagnostic information. Runtime feedback channels localize specific failure modes.

Unified representation: The typed DSL makes cross-component edits expressible as local program rewrites rather than requiring separate optimization procedures.

Implementation

The system validates proposed harnesses through three stages:

  1. Syntactic parsing of DSL code
  2. Well-formedness check (template resolution, graph connectivity)
  3. Smoke test on single task

Approximately 20% of proposals fail validation and are rejected before expensive LLM evaluation, significantly reducing optimization costs.

Conclusion

AgentFlow demonstrates that automated harness synthesis can match or exceed hand-designed systems while discovering real vulnerabilities in production codebases. The typed DSL and feedback-driven optimization provide a principled approach to multi-agent system design that generalizes across domains, models, and task types.

The framework’s success on both benchmark tasks and real-world vulnerability discovery suggests that automated agent orchestration represents a promising direction for scaling LLM-based security analysis to increasingly complex targets.