Scaling Latent Reasoning via Looped Language Models

Large language models traditionally scale through increasing parameters, data, and compute. This paper introduces a third scaling dimension: iterative computation depth through parameter reuse. The authors present Ouro, a family of Looped Language Models (LoopLMs) that achieve 2-3× parameter efficiency by implementing recurrent computation with shared weights.

The Core Innovation

LoopLMs apply the same transformer layers multiple times in sequence, creating deeper computation without additional parameters. Unlike chain-of-thought reasoning that extends output sequences, LoopLMs deepen internal processing while maintaining fixed context length.

The architecture includes:

  • Shared transformer blocks applied T times recurrently
  • Adaptive exit gates that learn when to stop iterating
  • Entropy-regularized training to prevent collapse to single depths

Key Results

The 1.4B and 2.6B Ouro models match performance of 4B and 8B standard transformers across reasoning benchmarks:

  • MMLU-Pro: Ouro-2.6B achieves 55.73 vs 53.72 for Qwen3-8B
  • BBH: Ouro-2.6B reaches 80.46 vs 77.65 for Qwen3-8B
  • MATH500: Ouro-1.4B scores 82.40 vs 59.60 for Qwen3-4B

Performance scales predictably with recurrent depth, peaking around the trained maximum of 4 steps.

Training Methodology

The training uses a two-stage approach:

Stage I optimizes an entropy-regularized objective:

L = Σ p(t|x) L^(t) - β H(p(·|x))

where p(t|x) is the learned exit distribution and L^(t) is loss at step t.

Stage II fine-tunes exit gates using performance improvement signals, teaching the model when additional computation helps.

Understanding the Mechanism

Controlled experiments reveal LoopLMs don’t increase knowledge storage capacity (both looped and standard models achieve ~2 bits per parameter). Instead, they excel at knowledge manipulation - composing facts and multi-step reasoning.

On synthetic tasks requiring knowledge composition:

  • Mano arithmetic: LoopLMs outperform iso-parameter baselines
  • Multi-hop QA: LoopLMs learn with fewer training examples
  • MMLU analysis: Largest gains appear in reasoning-heavy categories (logic, math) rather than knowledge-heavy ones (facts, trivia)

Practical Benefits

Inference Efficiency: KV cache sharing during decoding reduces memory by 4× with minimal performance loss.

Safety: Model safety improves with additional recurrent steps, even when extrapolating beyond training depth.

Faithfulness: Unlike chain-of-thought, LoopLM’s latent reasoning shows genuine decision revision across steps rather than post-hoc rationalization.

Implementation Details

The models use standard transformer architecture with:

  • RoPE positional embeddings
  • SwiGLU activations
  • Sandwich normalization for stability
  • 49,152-token vocabulary

Training spans 7.7T tokens across four stages, progressing from web data to high-quality reasoning datasets.

Implications

This work establishes recurrent depth as a viable third scaling axis beyond parameters and data. The approach offers particular value for deployment scenarios requiring parameter efficiency while maintaining reasoning capability.

The results suggest that architectural innovation through parameter reuse can achieve scaling benefits traditionally requiring larger models, opening new directions for efficient language model development.