How We Built Our Multi-Agent Research System
Anthropic’s Research feature uses multiple Claude agents working in parallel to tackle complex research tasks more effectively than single-agent approaches. Here’s what we learned building this system from prototype to production.
Why Multi-Agent Systems Excel at Research
Research involves unpredictable, path-dependent exploration. You can’t hardcode fixed steps for investigating complex topics—the process adapts based on discoveries along the way.
Multi-agent systems solve this through parallel exploration. While a single agent searches sequentially, our system spawns specialized subagents that investigate different aspects simultaneously. Each subagent operates in its own context window, providing separation of concerns and reducing path dependency.
Our internal evaluations show dramatic improvements: multi-agent Claude Opus 4 with Sonnet 4 subagents outperformed single-agent Opus 4 by 90.2% on research tasks. The system excels at breadth-first queries requiring multiple independent directions.
The key insight: token usage explains 80% of performance variance in browsing evaluations. Multi-agent architectures effectively scale token usage by distributing work across separate context windows.
Architecture: Orchestrator-Worker Pattern
Our system uses a lead agent that coordinates specialized subagents operating in parallel.
When you submit a query, the lead agent:
- Analyzes the request and develops a strategy
- Spawns subagents to explore different aspects simultaneously
- Synthesizes results and determines if more research is needed
- Passes findings to a citation agent for proper attribution
This differs from traditional RAG approaches that use static retrieval. Our architecture performs multi-step search that dynamically finds information, adapts to findings, and analyzes results.
Prompt Engineering Lessons
Multi-agent systems introduce coordination complexity that single-agent prompting doesn’t address. Here’s what worked:
Think like your agents. Build simulations using the exact prompts and tools from your system. Watch agents work step-by-step to understand failure modes like continuing when they have sufficient results or using overly verbose search queries.
Teach delegation explicitly. Each subagent needs an objective, output format, tool guidance, and clear task boundaries. Vague instructions like “research the semiconductor shortage” led to duplicated work and missed coverage.
Scale effort to query complexity. Embed scaling rules in prompts: simple fact-finding needs 1 agent with 3-10 tool calls, complex research might use 10+ subagents with divided responsibilities.
Design agent-tool interfaces carefully. Tool descriptions must be distinct and clear. Bad descriptions send agents down wrong paths. We created a tool-testing agent that rewrites descriptions after testing tools dozens of times, reducing task completion time by 40%.
Start wide, then narrow. Prompt agents to begin with short, broad queries, evaluate what’s available, then progressively focus. This mirrors expert human research patterns.
Use parallel tool calling. Sequential searches are painfully slow. We introduced two types of parallelization: lead agents spawn 3-5 subagents simultaneously, and subagents use 3+ tools in parallel. This cut research time by up to 90% for complex queries.
Evaluation Strategies
Multi-agent systems don’t follow identical paths between runs, making traditional evaluation methods insufficient.
Start small, iterate fast. Begin with 20 representative queries. Early changes have dramatic impacts—a prompt tweak might boost success from 30% to 80%. Small samples reveal these large effect sizes clearly.
Use LLM judges effectively. Research outputs resist programmatic evaluation. We used a single LLM judge evaluating factual accuracy, citation accuracy, completeness, source quality, and tool efficiency. This scaled to hundreds of outputs while aligning with human judgment.
Maintain human evaluation. People catch edge cases automation misses, like agents choosing SEO-optimized content farms over authoritative sources. Human testers revealed subtle biases that led to prompt improvements.
Production Engineering Challenges
Handle stateful errors gracefully. Agents run for extended periods, maintaining state across many tool calls. Minor failures can cascade into major behavioral changes. We built resume-from-checkpoint systems and let agents adapt to tool failures using their intelligence.
Debug with new approaches. Non-deterministic agent behavior makes traditional debugging insufficient. We added full production tracing to diagnose why agents failed, monitoring decision patterns and interaction structures without accessing conversation contents.
Deploy carefully. Agent systems are stateful webs of prompts, tools, and execution logic. We use rainbow deployments to gradually shift traffic from old to new versions while keeping both running simultaneously.
Key Takeaways
Multi-agent systems transform how people solve complex problems, but the gap between prototype and production is wider than anticipated. Success requires:
- Careful prompt and tool design with tight feedback loops
- Comprehensive evaluation combining automated and human testing
- Robust operational practices for stateful, long-running processes
- Strong collaboration between research, product, and engineering teams
Users report that Claude’s Research feature helps them find business opportunities, navigate complex decisions, and save days of work by uncovering connections they wouldn’t have found alone.
The compound nature of errors in agentic systems means minor issues for traditional software can derail agents entirely. But with proper engineering, multi-agent research systems operate reliably at scale and deliver transformative value for open-ended research tasks.