MicroGPT: A 200-Line Implementation of GPT from Scratch

Andrej Karpathy has created MicroGPT, a complete GPT implementation in just 200 lines of pure Python with zero dependencies. This single file contains everything needed to train and run a language model: dataset handling, tokenization, autograd engine, transformer architecture, and training loop.

What MicroGPT Contains

MicroGPT demonstrates the complete algorithmic essence of modern language models. The implementation includes:

Dataset processing for 32,000 names
Character-level tokenizer mapping text to integers
Autograd engine for automatic differentiation
GPT-2-style transformer with attention and MLP blocks
Adam optimizer with learning rate decay
Inference loop for generating new text

The model learns to generate plausible names like “kamon,” “vialan,” and “keylen” after training on real name data.

The Autograd Engine

The Value class implements automatic differentiation from scratch. Each Value wraps a scalar number and tracks how it was computed:

1
2
3
4
5
6
class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # scalar value
        self.grad = 0                   # gradient
        self._children = children       # computation graph children
        self._local_grads = local_grads # local derivatives

Operations like addition and multiplication create new Value objects that remember their inputs and local gradients. The backward() method applies the chain rule to compute gradients for all parameters.

Transformer Architecture

The gpt() function processes one token at a time, building up a key-value cache for attention. The architecture follows GPT-2 with simplifications:

Token and position embeddings convert discrete tokens to vectors
Multi-head attention lets tokens communicate with previous positions
MLP blocks perform local computation at each position
Residual connections enable gradient flow through deep networks
RMSNorm stabilizes activations

The model has 4,192 parameters compared to billions in production models, but uses identical algorithms.

Training Process

Training follows the standard language modeling approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Forward pass: predict next token at each position
for pos_id in range(n):
    token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax(logits)
    loss_t = -probs[target_id].log()  # cross-entropy loss
    losses.append(loss_t)

# Backward pass: compute gradients
loss = sum(losses) / n
loss.backward()

# Update parameters with Adam optimizer
for i, p in enumerate(params):
    # Adam momentum and adaptive learning rates
    p.data -= lr * m_hat / (v_hat ** 0.5 + eps)

The loss decreases from ~3.3 (random guessing) to ~2.37 over 1,000 training steps.

From MicroGPT to ChatGPT

MicroGPT contains the algorithmic core of production language models. The path to ChatGPT involves scaling up each component:

Data: Trillions of tokens from web pages, books, and code
Tokenizer: Subword encoding with ~100K vocabulary
Architecture: Hundreds of billions of parameters across 100+ layers
Training: Massive GPU clusters running for months
Post-training: Supervised fine-tuning and reinforcement learning

The fundamental algorithm remains identical: predict the next token, compute gradients, update parameters.

Running MicroGPT

The complete implementation requires only Python with no dependencies:

1
python microgpt.py

Training takes about one minute on a laptop. You can experiment by changing the dataset, increasing model size, or training longer. The code demonstrates that the core of modern AI is surprisingly simple—everything else is engineering for scale.

MicroGPT proves that understanding language models doesn’t require complex frameworks or massive compute. The essential algorithms fit in 200 lines of readable Python code.

MicroGPT: A 200-Line Implementation of GPT from Scratch

MicroGPT: A 200-Line Implementation of GPT from Scratch

What MicroGPT Contains

The Autograd Engine

Transformer Architecture

Training Process

From MicroGPT to ChatGPT

Running MicroGPT

Building GPT from Scratch: A Complete Guide to Transformer Architecture

Building Character-Level Language Models: From Bigrams to GPT-2

Attention Is All You Need: The Transformer Architecture That Revolutionized AI