LeWorldModel: Stable End-to-End World Model Learning from Pixels

LeWorldModel introduces a breakthrough in learning world models from raw pixels. This Joint-Embedding Predictive Architecture (JEPA) achieves stable end-to-end training using only two loss terms, dramatically simplifying the training process while delivering competitive performance.

The Collapse Problem in World Models

World models learn to predict future states by encoding observations into compact representations and modeling dynamics in this latent space. However, existing methods face a critical challenge: representation collapse. The model can trivially satisfy prediction objectives by mapping all inputs to identical representations, rendering the learned model useless.

Current solutions rely on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision. These approaches introduce training instability and require extensive hyperparameter tuning—PLDM uses six tunable loss coefficients requiring polynomial-time search.

LeWorldModel’s Simple Solution

LeWorldModel solves collapse with remarkable simplicity. The training objective combines just two terms:

Prediction Loss: Standard mean-squared error between predicted and actual next-state embeddings
SIGReg Regularization: Enforces Gaussian-distributed latent embeddings using statistical normality tests

This reduces tunable hyperparameters from six to one, enabling efficient logarithmic-time hyperparameter search.

The SIGReg Innovation

SIGReg prevents collapse by encouraging latent embeddings to match an isotropic Gaussian distribution. The method projects embeddings onto random directions and applies univariate normality tests to each projection. By the Cramér-Wold theorem, matching all one-dimensional marginals ensures the full joint distribution matches the target Gaussian.

This approach provides theoretical guarantees against collapse while remaining computationally efficient and stable across different architectures.

Performance and Efficiency

LeWorldModel delivers impressive results across diverse 2D and 3D control tasks:

Planning Speed: 48× faster than foundation-model-based approaches
Model Size: Compact 15M parameters trainable on a single GPU
Performance: Outperforms existing end-to-end methods, achieving 18% higher success rates on challenging manipulation tasks
Training Stability: Smooth, monotonic convergence compared to noisy multi-term objectives

The method works across navigation, manipulation, and locomotion tasks in both 2D and 3D environments, demonstrating broad applicability.

Physical Understanding

Beyond control performance, LeWorldModel learns meaningful physical representations. Probing experiments show the latent space encodes:

Object positions and orientations
Agent locations and velocities
Physical relationships between entities

Violation-of-expectation tests confirm the model reliably detects physically implausible events, assigning higher surprise to teleportation violations than visual perturbations.

Remarkably, temporal straightening emerges naturally during training—latent trajectories become increasingly linear over time without explicit regularization, a property linked to effective planning.

Implementation Details

LeWorldModel uses a Vision Transformer encoder (5M parameters) and transformer predictor (10M parameters). The encoder maps 224×224 pixel observations to compact latent representations, while the predictor models dynamics by predicting future embeddings conditioned on actions.

Training requires no stop-gradient operations, exponential moving averages, or architectural tricks. All parameters optimize jointly end-to-end using standard gradient descent.

For planning, the method performs trajectory optimization in latent space using the Cross-Entropy Method, optimizing action sequences to minimize distance to goal embeddings.

Key Advantages

LeWorldModel offers several compelling benefits:

Simplicity: Two-term objective versus complex multi-term losses
Stability: Provable anti-collapse guarantees and smooth training
Efficiency: Single GPU training and fast planning
Generality: Works across diverse environments and architectures
Interpretability: Clear physical structure in learned representations

Getting Started

The method provides a practical alternative to existing world model approaches. With minimal hyperparameter tuning and stable training dynamics, LeWorldModel lowers barriers to world model research while delivering competitive performance.

The approach particularly benefits researchers seeking:

Stable end-to-end training from pixels
Fast planning for real-time applications
Interpretable learned representations
Simple, principled training objectives

LeWorldModel demonstrates that effective world model learning need not require complex training procedures—sometimes the simplest solutions prove most powerful.

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels