MicroGPT: A 200-Line Implementation of GPT from Scratch

Andrej Karpathy has created MicroGPT, a complete GPT implementation in just 200 lines of pure Python with zero dependencies. This single file contains everything needed to train and run a language model: dataset handling, tokenization, autograd engine, transformer architecture, and training loop.

What MicroGPT Contains

MicroGPT demonstrates the complete algorithmic essence of modern language models. The implementation includes:

  • Dataset processing for 32,000 names
  • Character-level tokenizer mapping text to integers
  • Autograd engine for automatic differentiation
  • GPT-2-style transformer with attention and MLP blocks
  • Adam optimizer with learning rate decay
  • Inference loop for generating new text

The model learns to generate plausible names like “kamon,” “vialan,” and “keylen” after training on real name data.

The Autograd Engine

The Value class implements automatic differentiation from scratch. Each Value wraps a scalar number and tracks how it was computed:

1
2
3
4
5
6
class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # scalar value
        self.grad = 0                   # gradient
        self._children = children       # computation graph children
        self._local_grads = local_grads # local derivatives

Operations like addition and multiplication create new Value objects that remember their inputs and local gradients. The backward() method applies the chain rule to compute gradients for all parameters.

Transformer Architecture

The gpt() function processes one token at a time, building up a key-value cache for attention. The architecture follows GPT-2 with simplifications:

  1. Token and position embeddings convert discrete tokens to vectors
  2. Multi-head attention lets tokens communicate with previous positions
  3. MLP blocks perform local computation at each position
  4. Residual connections enable gradient flow through deep networks
  5. RMSNorm stabilizes activations

The model has 4,192 parameters compared to billions in production models, but uses identical algorithms.

Training Process

Training follows the standard language modeling approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Forward pass: predict next token at each position
for pos_id in range(n):
    token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax(logits)
    loss_t = -probs[target_id].log()  # cross-entropy loss
    losses.append(loss_t)

# Backward pass: compute gradients
loss = sum(losses) / n
loss.backward()

# Update parameters with Adam optimizer
for i, p in enumerate(params):
    # Adam momentum and adaptive learning rates
    p.data -= lr * m_hat / (v_hat ** 0.5 + eps)

The loss decreases from ~3.3 (random guessing) to ~2.37 over 1,000 training steps.

From MicroGPT to ChatGPT

MicroGPT contains the algorithmic core of production language models. The path to ChatGPT involves scaling up each component:

  • Data: Trillions of tokens from web pages, books, and code
  • Tokenizer: Subword encoding with ~100K vocabulary
  • Architecture: Hundreds of billions of parameters across 100+ layers
  • Training: Massive GPU clusters running for months
  • Post-training: Supervised fine-tuning and reinforcement learning

The fundamental algorithm remains identical: predict the next token, compute gradients, update parameters.

Running MicroGPT

The complete implementation requires only Python with no dependencies:

1
python microgpt.py

Training takes about one minute on a laptop. You can experiment by changing the dataset, increasing model size, or training longer. The code demonstrates that the core of modern AI is surprisingly simple—everything else is engineering for scale.

MicroGPT proves that understanding language models doesn’t require complex frameworks or massive compute. The essential algorithms fit in 200 lines of readable Python code.