Building GPT from Scratch: A Complete Guide to Transformer Architecture

Building GPT from Scratch: A Complete Guide to Transformer Architecture

ChatGPT has revolutionized AI by demonstrating the power of transformer-based language models. While ChatGPT itself is a massive production system trained on billions of parameters, you can understand its core architecture by building a simplified version from scratch. This guide walks you through implementing a GPT-style transformer that generates Shakespeare-like text, teaching you the fundamental concepts behind modern language models.

Understanding the Foundation

A language model predicts the next character or token in a sequence. Given “Hello wor”, it should predict “l” as the most likely next character. This simple concept, when scaled up with transformers, creates systems capable of human-like text generation.

We’ll build our model using the “Tiny Shakespeare” dataset—a 1MB file containing all of Shakespeare’s works. Our character-level model will learn patterns in this text and generate new Shakespeare-like passages.

Setting Up the Data Pipeline

First, create a simple character-level tokenizer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Read the dataset
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Get all unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Create character mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encoder and decoder functions
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

This creates a vocabulary of 65 unique characters and provides functions to convert between text and integer sequences.

Building the Bigram Baseline

Start with the simplest possible language model—a bigram model that predicts the next character based only on the current character:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import torch
import torch.nn as nn

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)  # (B, T, C)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

This model uses an embedding table where each character looks up its own row to predict what comes next. While simple, it establishes our training framework.

Implementing Self-Attention

The key innovation in transformers is self-attention, which allows tokens to communicate with each other. Here’s how it works:

The Mathematical Trick

Before diving into attention, understand this crucial operation. To make tokens communicate, we need each token to aggregate information from previous tokens. The naive approach uses loops:

1
2
3
4
# Inefficient way - for illustration only
for t in range(T):
    x_prev = x[:, :t+1]  # All previous tokens including current
    x_bow[t] = torch.mean(x_prev, dim=1)  # Average them

The efficient way uses matrix multiplication with a lower triangular matrix:

1
2
3
4
# Efficient batched approach
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)  # Normalize to get averages
out = wei @ x  # Batched matrix multiply

Single Head of Self-Attention

Now implement the core self-attention mechanism:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * (C**-0.5)  # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        # Apply attention to values
        v = self.value(x)  # (B, T, head_size)
        out = wei @ v      # (B, T, head_size)
        return out

Key concepts:

  • Query: “What am I looking for?”
  • Key: “What do I contain?”
  • Value: “What information do I actually communicate?”
  • Scaling: Divide by √(head_size) to prevent softmax from becoming too sharp
  • Masking: Prevent future tokens from influencing past tokens

Multi-Head Attention

Run multiple attention heads in parallel and concatenate their outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

Multiple heads allow the model to attend to different types of patterns simultaneously—one head might focus on syntax while another focuses on semantics.

Building the Complete Transformer Block

Combine attention with feed-forward processing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        
    def forward(self, x):
        x = x + self.sa(self.ln1(x))      # Attention with residual connection
        x = x + self.ffwd(self.ln2(x))    # Feed-forward with residual connection
        return x

Critical optimizations:

  • Residual connections: Allow gradients to flow directly through the network
  • Layer normalization: Stabilize training by normalizing features
  • Feed-forward expansion: The inner layer is 4x larger, providing computational capacity

The Complete GPT Model

Assemble everything into the final model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class GPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # Embeddings
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        
        # Transformer blocks
        x = self.blocks(x)  # (B, T, C)
        x = self.ln_f(x)    # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss

Training and Results

With proper hyperparameters:

  • Batch size: 64
  • Block size: 256 characters of context
  • Embedding dimension: 384
  • Number of heads: 6
  • Number of layers: 6
  • Learning rate: 3e-4

The model achieves a validation loss of approximately 1.48 and generates coherent Shakespeare-like text:

DUKE OF AUMERLE:
Madam, I would not be so bold to say
That I am guiltless of your majesty's
Displeasure; but I hope your grace will pardon
My rashness, and accept my humble suit.

Key Insights

Attention as Communication: Self-attention creates a communication mechanism where tokens can selectively gather information from other tokens based on their content.

Scalability: This architecture scales remarkably well. GPT-3 uses essentially the same structure but with 175 billion parameters instead of our 10 million.

Decoder-Only Design: Unlike the original transformer paper (which used encoder-decoder for translation), GPT uses only the decoder portion for autoregressive text generation.

Position Matters: Since attention operates on sets of vectors, positional embeddings are crucial for the model to understand sequence order.

From GPT to ChatGPT

Our model generates text but doesn’t follow instructions. ChatGPT adds:

  1. Massive scale: 175B+ parameters trained on hundreds of billions of tokens
  2. Instruction tuning: Fine-tuning on question-answer pairs
  3. Human feedback: Training a reward model from human preferences
  4. Reinforcement learning: Using PPO to optimize for human-preferred responses

Next Steps

You now understand the core architecture behind modern language models. To go further:

  • Experiment with different hyperparameters
  • Try training on different datasets
  • Implement techniques like dropout for regularization
  • Explore fine-tuning for specific tasks
  • Study the latest developments in transformer architectures

The complete code is available in the nanoGPT repository, providing a clean, educational implementation of these concepts. With this foundation, you’re ready to explore the cutting edge of language model research and development.