V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Researchers have developed V-JEPA 2, a self-supervised video model that learns from over 1 million hours of internet video to understand motion, predict future states, and enable robotic planning. The model achieves state-of-the-art performance on video understanding tasks and can control robots for object manipulation without task-specific training.

The Challenge of Learning from Observation

Current AI systems struggle to learn about the physical world primarily through observation. Most models require extensive labeled data or task-specific training to perform well. V-JEPA 2 addresses this limitation by combining massive-scale video pre-training with minimal robot interaction data.

How V-JEPA 2 Works

The model uses a joint-embedding-predictive architecture (JEPA) that learns representations without requiring action labels during pre-training. This approach enables the model to understand temporal dynamics and spatial relationships in video data.

Training Process

  1. Pre-training: The model trains on over 1 million hours of internet video and images without action labels
  2. Alignment: Researchers align the model with a large language model for video question-answering
  3. Robot adaptation: A small amount of robot trajectory data (62 hours) adapts the model for planning tasks

Performance Results

V-JEPA 2 demonstrates strong performance across multiple benchmarks:

  • Motion understanding: 77.3% top-1 accuracy on Something-Something v2
  • Action anticipation: 39.7% recall-at-5 on Epic-Kitchens-100 (state-of-the-art)
  • Video question-answering: 84.0% on PerceptionTest, 76.9% on TempCompass

Robotic Applications

The researchers created V-JEPA 2-AC, an action-conditioned version for robotic planning. This model enables zero-shot deployment on Franka robotic arms for pick-and-place tasks using only image goals.

Key Advantages

  • No environment-specific data collection required
  • No task-specific training needed
  • No reward engineering necessary
  • Works across different laboratory environments

Implementation Considerations

When deploying V-JEPA 2 for robotics applications, consider these factors:

  • The model requires minimal robot interaction data for adaptation
  • Zero-shot deployment works for basic manipulation tasks
  • Performance may vary with object types and environmental conditions
  • Integration with existing robotic systems requires careful planning

Next Steps

V-JEPA 2 demonstrates how self-supervised learning from web-scale video data can create versatile world models. Future research should explore expanding the range of robotic tasks, improving sample efficiency, and scaling to more complex environments.

For robotics practitioners, this work suggests that large-scale video pre-training combined with minimal robot data can enable flexible manipulation capabilities without extensive task-specific engineering.