SkillOpt: A Text-Space Optimizer for Self-Evolving Agent Skills

Agent skills today suffer from a fundamental problem: they’re either hand-crafted, generated once, or evolved through uncontrolled self-revision. None of these approaches behave like a proper optimizer for the skill itself. SkillOpt changes this by treating the skill document as trainable external state, applying deep-learning-style controls to text-space optimization.

The Core Innovation

SkillOpt introduces the first systematic text-space optimizer for agent skills. Instead of modifying model weights, it optimizes a compact skill document through bounded edits, validation gates, and epoch-wise updates. A separate optimizer model converts scored rollouts into structured add/delete/replace operations on the skill text, accepting changes only when they improve held-out validation performance.

The approach mirrors weight-space optimization: rollout batches control evidence noise, textual learning rates limit edit magnitude, validation gates prevent harmful updates, and rejected-edit buffers provide negative feedback. This creates a controlled training loop for procedural knowledge without touching model parameters.

How It Works

The system maintains three key components:

Target Model: Executes tasks using the current skill document, remaining completely frozen throughout optimization.

Optimizer Model: Analyzes trajectory batches, proposes bounded skill edits, and ranks changes by expected utility.

Validation Gate: Evaluates each candidate skill on held-out data, accepting only strict improvements over the current selection score.

The optimization loop operates in epochs. Each step samples trajectory batches, reflects on failures and successes separately, merges proposals hierarchically, and applies a bounded number of top-ranked edits. Rejected edits become negative feedback for future iterations, while epoch-wise slow updates preserve longer-horizon patterns.

Remarkable Results

SkillOpt dominates across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses. On GPT-5.5, it lifts average performance by +23.5 points in direct chat, +24.8 points in Codex harness, and +19.1 points in Claude Code harness.

The gains are largest on procedural benchmarks where reusable rules matter most:

SpreadsheetBench: 41.8 → 80.7 (+38.9 points)
OfficeQA: 33.1 → 72.1 (+39.0 points)
LiveMathematicianBench: 37.6 → 66.9 (+29.3 points)

Even small models benefit dramatically. GPT-5.4-nano nearly doubles on DocVQA and triples on ALFWorld, while Qwen3.5-4B sees a 2.6× improvement on SpreadsheetBench.

Compact, Transferable Artifacts

The learned skills remain remarkably compact—300 to 2,000 tokens after only 1-4 accepted edits. These artifacts transfer effectively across model scales, execution harnesses, and related benchmarks without further optimization.

Cross-model transfer shows skills trained on larger models improve smaller variants. Cross-harness transfer demonstrates a SpreadsheetBench skill trained in Codex transfers to Claude Code with +59.7 point gain. Cross-benchmark transfer validates that math skills generalize to related domains.

What the Skills Actually Learn

The optimized skills encode procedural discipline that frontier models lack zero-shot:

SearchQA: “Choose the shortest canonical entity supported by co-occurring distinctive evidence”
SpreadsheetBench: “Write evaluated static values across the full target range instead of relying on Excel recalculation”
OfficeQA: “Output exactly the requested rounded value without extra labels”
DocVQA: “First bind the question to the exact visual row/header/field, then copy only the aligned answer span”

These rules are procedural rather than instance-specific, addressing systematic failure modes through generalizable constraints.

Why Bounded Updates Matter

The key insight is treating skill editing as optimization rather than rewriting. Unbounded changes can erase useful rules or introduce contradictions. Bounded updates with validation gates ensure each revision stays close enough to the previous version that optimization history remains meaningful.

Ablation studies confirm this design. Removing the textual learning rate drops SpreadsheetBench performance by 1.8 points. Eliminating the rejected-edit buffer costs 2.4-4.6 points. Most dramatically, removing both meta skill and slow update crashes SpreadsheetBench from 77.5 to 55.0 points.

Practical Deployment

SkillOpt’s harness-agnostic design enables deployment across execution environments. The same best_skill.md file works in direct chat, Codex workspaces, and Claude Code environments. Training costs range from 0.6M to 46.4M tokens per test-set point, paid once during optimization with zero inference-time overhead.

The exported skill artifact is auditable, portable, and reusable. Domain practitioners can read, modify, and deploy the learned procedures without model weight updates or specialized infrastructure.

Looking Forward

SkillOpt demonstrates that compact natural-language skills can serve as an effective domain-adaptation layer for frontier agents. By treating the skill document itself as trainable state, it enables systematic improvement through text-space optimization with deep-learning-style controls.

This opens new directions for agent adaptation: skill libraries sharing infrastructure across domains, preference-driven validation for open-ended tasks, and self-distillation of optimized skills back into model weights. The core principle—optimize the skill, not just the prompt—provides a foundation for applying the full optimization toolkit to procedural agent knowledge.

SkillOpt: A Text-Space Optimizer for Self-Evolving Agent Skills

SkillOpt: A Text-Space Optimizer for Self-Evolving Agent Skills

The Core Innovation

How It Works

Remarkable Results

Compact, Transferable Artifacts

What the Skills Actually Learn

Why Bounded Updates Matter

Practical Deployment

Looking Forward

When Bigger Models Perform Worse: Brevity Constraints Reverse Performance Hierarchies in Language Models

Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research

Do Threats and Tips Actually Improve AI Performance? A Rigorous Benchmark Study