Mimosa Framework: Self-Evolving Multi-Agent Systems for Scientific Research

Scientists face a growing bottleneck: data generation outpaces analysis capabilities, while rigid AI systems can’t adapt to evolving research needs. Mimosa addresses this challenge by automatically creating and refining multi-agent workflows that learn from experience.

The Problem with Current Scientific AI

Existing autonomous research systems suffer from two critical limitations. Single-agent systems collapse under long reasoning chains, losing context and making repetitive errors. Multi-agent systems improve task decomposition but rely on fixed coordination protocols that can’t adapt when experiments behave unexpectedly or new tools become available.

Consider computational drug design. A typical study moves through virtual screening, molecular docking, and dynamics simulations. Each stage requires different expertise and tools. Results from later stages often challenge earlier assumptions, requiring iterative revision. Current AI systems lack mechanisms to maintain specialized roles across stages while critically reassessing decisions as findings emerge.

How Mimosa Works

Mimosa operates through five coordinated layers that transform static pipelines into adaptive, learning systems:

Dynamic Tool Discovery: The system scans for available computational tools using the Model Context Protocol (MCP), automatically integrating new capabilities without manual configuration.

Workflow Synthesis: A meta-orchestrator generates task-specific multi-agent workflows on demand. Rather than using predefined templates, it creates directed graphs where specialized agents handle distinct subtasks with bounded context windows.

Iterative Refinement: The system executes workflows, evaluates performance using an LLM-based judge, and proposes targeted improvements. This creates a feedback loop where workflows evolve based on empirical results.

Code-Based Execution: Agents generate and execute Python code directly, enabling complex tool orchestration and data analysis within scientific computing environments.

Reproducible Archiving: All execution traces are logged and workflows are archived, supporting auditability and reuse across similar tasks.

Performance Results

Testing on ScienceAgentBench—102 computational tasks across bioinformatics, chemistry, and other scientific domains—reveals significant improvements through workflow evolution.

DeepSeek-V3.2 achieved the strongest results: 43.1% success rate with iterative learning versus 32.4% with static multi-agent coordination. This represents a 33% relative improvement while maintaining cost efficiency at $1.7 per task.

The results reveal model-specific responses to multi-agent decomposition. GPT-4o improved 4× from single-agent to multi-agent configurations (3.8% to 18.6% success rate). DeepSeek-V3.2 showed a different pattern: strong single-agent performance (38.2%) initially degraded under static coordination before surpassing baselines through iterative learning.

Workflow evolution shows consistent gains across iterations 1-8, with diminishing returns appearing around iteration 10. This suggests the current greedy search strategy reaches performance ceilings that could be addressed through population-based exploration.

Key Technical Innovations

Semantic Drift Mitigation: Multi-agent decomposition prevents the context accumulation that causes single agents to lose focus over long reasoning chains.

Adaptive Architecture: Unlike fixed coordination protocols, Mimosa’s meta-orchestrator can restructure agent roles, communication patterns, and tool allocations based on task feedback.

Tool-Agnostic Design: MCP integration enables workflows to adapt to changing computational environments without system modifications.

Empirical Optimization: LLM-based evaluation provides directional feedback sufficient to guide workflow improvements toward better task outcomes.

Implications for Scientific Computing

Mimosa demonstrates that multi-agent architectures can achieve competitive performance with substantially lower computational costs than frontier reasoning models. The system achieved 43.1% success rate using DeepSeek-V3.2 at approximately 27× lower input cost than comparable o1-preview results.

This suggests a pathway toward more sustainable scientific AI: decomposing complex tasks across specialized, efficient models while leveraging established domain tools refined over decades of scientific computing.

The framework’s open-source design and reproducible execution traces address reproducibility challenges in computational research. Every analytical step is recorded and auditable, supporting verification of published results.

Current Limitations and Future Directions

The LLM-based judge provides directional but coarse-grained feedback. While sufficient for workflow improvement, correlation with ground-truth metrics requires further validation.

Single-incumbent search reaches performance plateaus around iteration 10. Population-based evolutionary strategies could extend learning beyond current ceilings.

Archive retrieval mechanisms are implemented but weren’t exercised due to task diversity in evaluation. Systematic testing with related task families would validate workflow transfer capabilities.

The system currently handles environment setup autonomously, which differs from pre-configured benchmark baselines. This design choice supports end-to-end autonomy but complicates direct performance comparisons.

Getting Started

Mimosa and its companion tool management platform Toolomics are available as open-source software under Apache License 2.0:

The modular architecture supports integration with existing scientific computing environments while enabling extension to new domains and tools.

Conclusion

Mimosa represents a step toward truly adaptive scientific AI systems. By treating workflow design as a learnable component rather than a fixed constraint, the framework enables autonomous systems that improve through experience while maintaining the transparency and reproducibility that scientific research demands.

The heterogeneous model responses observed suggest that optimal scientific AI architectures must be designed with specific model capabilities in mind. As the field advances, understanding these architecture-model interactions will be crucial for deploying effective autonomous research systems.