Ferret-UI Lite: Building Efficient 3B On-Device GUI Agents with Reinforcement Learning

Apple researchers have developed Ferret-UI Lite, a compact 3B multimodal language model that automates GUI interactions across mobile, web, and desktop platforms. This on-device agent addresses the growing need for low-latency, privacy-preserving AI assistants that work without constant internet connectivity.

The Challenge of Small GUI Agents

Most existing GUI agents rely on large server-side models like GPT-4 or Gemini, which offer impressive reasoning capabilities but require significant computational resources and network connectivity. For scenarios like voice-activated reminders while driving or recipe assistance with wet hands in the kitchen, users need lightweight models that run directly on their devices.

Building effective small GUI agents presents unique challenges. These models must handle diverse tasks including GUI element grounding (finding specific buttons or text fields), screen understanding, multi-step planning, and self-reflection—all within a compact 3B parameter footprint.

Key Technical Innovations

Curated Data Mixture

Ferret-UI Lite combines real and synthetic training data from multiple sources:

  • Real data: Public datasets spanning mobile, desktop, and web platforms including GroundUI, OSAtlas, UGround, and Aria-UI
  • High-resolution grounding data: Composite images created by concatenating multiple GUI screenshots to expose models to denser layouts
  • Chain-of-thought navigation data: Synthetic reasoning traces including planning, action thinking, and reflection components
  • Online synthetic data: Multi-agent system rollouts that introduce action errors and replanning strategies

Visual Tool-Use with Zoom-In

The model employs a two-stage visual processing approach:

  1. Generate initial prediction on full screenshot
  2. Crop image around predicted location and refine prediction on zoomed region

This mimics human behavior and helps the small model focus on relevant screen regions without processing excessive visual tokens.

Reinforcement Learning with Verifiable Rewards

Ferret-UI Lite uses Group Relative Policy Optimization (GRPO) with carefully designed reward functions:

For grounding tasks: Containment-based rewards that accept any prediction within the target bounding box, rather than requiring exact center point matches.

For navigation tasks: Combined rewards evaluating both action type correctness and parameter precision, with options for sparse or dense grounding rewards.

Performance Results

GUI Grounding Excellence

Ferret-UI Lite achieves strong grounding performance across benchmarks:

  • ScreenSpot-V2: 91.6% accuracy
  • ScreenSpot-Pro: 53.3% accuracy
  • OSWorld-G: 55.3% accuracy

These results outperform other 3B models by significant margins and approach the performance of much larger 7B models.

Multi-step navigation remains challenging:

  • AndroidWorld: 28.0% success rate
  • OSWorld: 19.8% success rate

While competitive with similar-sized models, these scores highlight the inherent difficulty of complex reasoning in lightweight agents.

Key Findings and Lessons

Data Balance Matters

Experiments reveal that grounding and navigation data mutually benefit each other, with balanced mixture ratios (50:50) achieving optimal results across both task types.

Synthetic Data Scaling

Progressive scaling of synthetic data from 5K to 17K trajectories yielded nearly 6% improvement in navigation performance, demonstrating the value of diverse synthetic training examples.

Reward Design Sensitivity

Small models prove sensitive to reinforcement learning reward structures. Combining action type rewards with grounding rewards consistently outperforms single-reward approaches, with dense rewards outperforming sparse alternatives.

Inference-Time Techniques Help

Chain-of-thought reasoning and visual zoom-in provide measurable improvements, though benefits remain limited compared to larger models.

Practical Implications

Ferret-UI Lite demonstrates that compact on-device GUI agents can achieve competitive grounding performance while maintaining the advantages of local processing: low latency, strong privacy guarantees, and offline functionality.

However, the research also reveals fundamental limitations. Multi-step navigation requiring complex reasoning remains challenging for 3B models, suggesting that certain agentic capabilities may require larger parameter counts or novel architectural approaches.

Next Steps

The findings provide valuable guidance for developers building on-device AI agents:

  1. Prioritize data diversity: Combine real and synthetic data sources with balanced task representation
  2. Design verifiable rewards carefully: Use combined reward functions that evaluate both action correctness and parameter precision
  3. Leverage inference-time techniques: Implement visual tool-use and chain-of-thought reasoning for incremental gains
  4. Set realistic expectations: Focus on single-step grounding tasks where small models excel, while considering hybrid approaches for complex multi-step scenarios

Ferret-UI Lite represents significant progress toward practical on-device GUI automation, even as it illuminates the ongoing challenges in scaling down complex agentic capabilities.