GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Researchers from UC Berkeley, Stanford, and other institutions have introduced GEPA (Genetic-Pareto), a prompt optimizer that challenges the dominance of reinforcement learning in AI system optimization. GEPA achieves superior performance while requiring dramatically fewer training examples—up to 35 times fewer rollouts than traditional methods.

The Sample Efficiency Problem

Current reinforcement learning approaches like Group Relative Policy Optimization (GRPO) require tens of thousands of rollouts to adapt large language models to new tasks. Recent studies show GRPO implementations typically use 100,000 to 500,000 rollouts for training across various applications. This sample inefficiency creates serious bottlenecks for AI systems that involve expensive tool calls, have limited inference budgets, or cannot fine-tune the largest models.

The core insight driving GEPA is that rollouts from sophisticated LLM systems contain rich natural language traces—instructions, reasoning chains, tool calls, and evaluation feedback. These traces can be understood and analyzed by modern LLMs, potentially enabling more effective learning than standard RL approaches that collapse this information into scalar rewards.

How GEPA Works

GEPA combines three core principles:

Genetic Prompt Evolution: The system maintains a candidate pool of prompt configurations and iteratively proposes new candidates through mutation or crossover operations. Each candidate inherits learning signals from its parents while accumulating new insights from current rollouts.

Reflective Prompt Mutation: When updating prompts, GEPA uses natural language feedback from system execution traces. An LLM examines these traces to perform implicit credit assignment, attributing successes or failures to specific modules and proposing targeted improvements.

Pareto-Based Candidate Selection: Instead of always selecting the best-performing candidate (which can lead to local optima), GEPA maintains a Pareto frontier. It identifies candidates that achieve the best score on at least one training instance, then stochastically samples from this diverse set of “winning” strategies.

Evaluation Results

GEPA was tested on four diverse tasks: multi-hop reasoning (HotpotQA), instruction following (IFBench), privacy-aware delegation (PUPA), and retrieval-augmented verification (HoVer). The results demonstrate consistent superiority:

Against GRPO: GEPA outperforms GRPO by 10% on average and up to 20% while using up to 35× fewer rollouts
Against MIPROv2: GEPA surpasses the leading prompt optimizer by over 10% across two LLMs
Sample Efficiency: GEPA matches GRPO’s best validation scores with only 102-1,179 training rollouts compared to GRPO’s 24,000

Key Advantages

Computational Efficiency: GEPA-generated prompts are up to 9.2× shorter than those from MIPROv2, reducing both runtime costs and latency for downstream tasks.

Rapid Adaptation: Even a single reflective prompt update can yield large improvements, as demonstrated in the optimization trajectories.

Generalization: Instruction-optimized prompts show better generalization compared to few-shot demonstration approaches, with lower generalization gaps between validation and test performance.

Practical Applications

Beyond task adaptation, GEPA shows promise as an inference-time search strategy. Preliminary experiments demonstrate its effectiveness for code optimization:

NPU Kernels: GEPA improved kernel performance from 4.25% to 30.52% mean vector utilization
CUDA Kernels: Generated kernels executing faster than PyTorch-eager for over 20% of representative tasks

Implementation Insights

GEPA’s success stems from leveraging the interpretable nature of language. The system can extract rich learning signals from execution traces that would be lost when collapsed into scalar rewards. The Pareto-based selection strategy prevents the optimizer from getting stuck in local optima while maintaining exploration of diverse strategies.

The approach is particularly valuable in resource-constrained environments where rollouts are expensive or when working with systems that cannot be fine-tuned at the weight level.

Next Steps

GEPA represents a shift toward language-driven optimization that capitalizes on LLMs’ improved instruction-following and self-reflective capabilities. For developers working with compound AI systems, GEPA offers a practical path to optimization that requires minimal computational resources while achieving superior performance.

The method’s ability to turn even a few rollouts into significant quality gains makes it especially relevant for real-world applications where extensive training is impractical or costly.