V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Researchers have developed V-JEPA 2, a self-supervised video model that learns from over 1 million hours of internet video to understand motion, predict future states, and enable robotic planning. The model achieves state-of-the-art performance on video understanding tasks and can control robots for object manipulation without task-specific training.

The Challenge of Learning from Observation

Current AI systems struggle to learn about the physical world primarily through observation. Most models require extensive labeled data or task-specific training to perform well. V-JEPA 2 addresses this limitation by combining massive-scale video pre-training with minimal robot interaction data.

How V-JEPA 2 Works

The model uses a joint-embedding-predictive architecture (JEPA) that learns representations without requiring action labels during pre-training. This approach enables the model to understand temporal dynamics and spatial relationships in video data.

Training Process

Pre-training: The model trains on over 1 million hours of internet video and images without action labels
Alignment: Researchers align the model with a large language model for video question-answering
Robot adaptation: A small amount of robot trajectory data (62 hours) adapts the model for planning tasks

Performance Results

V-JEPA 2 demonstrates strong performance across multiple benchmarks:

Motion understanding: 77.3% top-1 accuracy on Something-Something v2
Action anticipation: 39.7% recall-at-5 on Epic-Kitchens-100 (state-of-the-art)
Video question-answering: 84.0% on PerceptionTest, 76.9% on TempCompass

Robotic Applications

The researchers created V-JEPA 2-AC, an action-conditioned version for robotic planning. This model enables zero-shot deployment on Franka robotic arms for pick-and-place tasks using only image goals.

Key Advantages

No environment-specific data collection required
No task-specific training needed
No reward engineering necessary
Works across different laboratory environments

Implementation Considerations

When deploying V-JEPA 2 for robotics applications, consider these factors:

The model requires minimal robot interaction data for adaptation
Zero-shot deployment works for basic manipulation tasks
Performance may vary with object types and environmental conditions
Integration with existing robotic systems requires careful planning

Next Steps

V-JEPA 2 demonstrates how self-supervised learning from web-scale video data can create versatile world models. Future research should explore expanding the range of robotic tasks, improving sample efficiency, and scaling to more complex environments.

For robotics practitioners, this work suggests that large-scale video pre-training combined with minimal robot data can enable flexible manipulation capabilities without extensive task-specific engineering.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

The Challenge of Learning from Observation

How V-JEPA 2 Works

Training Process

Performance Results

Robotic Applications

Key Advantages

Implementation Considerations

Next Steps

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Harnesses for Inference-Time Alignment over Execution Trajectories

SkillOpt: A Text-Space Optimizer for Self-Evolving Agent Skills