Building Character-Level Language Models: From Bigrams to GPT-2

Step-by-step tutorial on building character-level language models, starting with simple bigram models and progressing to transformer architectures.

Building Character-Level Language Models: From Bigrams to GPT-2

Character-level language models predict the next character in a sequence by learning patterns from training data. This tutorial builds these models step-by-step, starting with simple bigram counting and progressing to neural network implementations that form the foundation of modern transformers.

Understanding Character-Level Language Models

A character-level language model treats text as sequences of individual characters. For the name “REESE”, the model sees the sequence R-E-E-S-E and learns to predict each next character given the previous context.

These models excel at generating text that follows learned patterns. Train one on a dataset of names, and it generates new, name-like sequences that sound plausible but don’t exist in the original data.

Building a Bigram Model

The simplest character-level model is a bigram model, which predicts the next character using only the immediately preceding character.

Preparing the Data

Start with a dataset of names and extract all character pairs (bigrams):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import torch
import matplotlib.pyplot as plt

# Load names dataset
words = open('names.txt', 'r').read().splitlines()

# Create bigrams with special start/end tokens
bigrams = []
for word in words:
    chars = ['.'] + list(word) + ['.']  # . marks start/end
    for ch1, ch2 in zip(chars, chars[1:]):
        bigrams.append((ch1, ch2))

Counting and Normalizing

Count bigram frequencies and convert to probabilities:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Count bigram occurrences
counts = {}
for ch1, ch2 in bigrams:
    counts[(ch1, ch2)] = counts.get((ch1, ch2), 0) + 1

# Convert to 2D array for efficiency
chars = sorted(list(set(''.join(words))))
chars = ['.'] + chars  # Start token first

# Create lookup tables
s2i = {s: i for i, s in enumerate(chars)}
i2s = {i: s for s, i in s2i.items()}

# Build count matrix
N = torch.zeros((27, 27), dtype=torch.int32)
for ch1, ch2 in bigrams:
    ix1, ix2 = s2i[ch1], s2i[ch2]
    N[ix1, ix2] += 1

# Convert to probabilities
P = N.float()
P = P / P.sum(1, keepdim=True)  # Normalize rows

Sampling from the Model

Generate new names by following the probability distributions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def sample_bigram(P, i2s, seed=42):
    g = torch.Generator().manual_seed(seed)
    
    name = []
    ix = 0  # Start with '.'
    
    while True:
        p = P[ix]
        ix = torch.multinomial(p, 1, replacement=True, generator=g).item()
        if ix == 0:  # End token
            break
        name.append(i2s[ix])
    
    return ''.join(name)

# Generate samples
for _ in range(5):
    print(sample_bigram(P, i2s))

Neural Network Implementation

The counting approach works for bigrams but doesn’t scale to longer contexts. Neural networks provide a flexible alternative that achieves identical results for bigrams while enabling future extensions.

Creating Training Data

Convert bigrams to input-output pairs for neural network training:

1
2
3
4
5
6
7
8
# Prepare training data
xs, ys = [], []
for ch1, ch2 in bigrams:
    xs.append(s2i[ch1])
    ys.append(s2i[ch2])

xs = torch.tensor(xs)
ys = torch.tensor(ys)

One-Hot Encoding

Neural networks need vector inputs, not integers:

1
2
3
4
5
import torch.nn.functional as F

# Convert integers to one-hot vectors
x_enc = F.one_hot(xs, num_classes=27).float()
print(f"Shape: {x_enc.shape}")  # [num_examples, 27]

Building the Neural Network

Create a single linear layer that maps 27 inputs to 27 outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Initialize weights randomly
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)

# Forward pass
def forward(x_enc, W):
    logits = x_enc @ W  # Matrix multiplication
    counts = logits.exp()  # Convert log-counts to counts
    probs = counts / counts.sum(1, keepdim=True)  # Normalize
    return probs

probs = forward(x_enc, W)

Training with Gradient Descent

Optimize the weights using negative log-likelihood loss:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Training loop
learning_rate = 50
for i in range(100):
    # Forward pass
    probs = forward(x_enc, W)
    
    # Calculate loss (negative log-likelihood)
    loss = -probs[torch.arange(len(ys)), ys].log().mean()
    
    # Backward pass
    W.grad = None
    loss.backward()
    
    # Update weights
    W.data += -learning_rate * W.grad
    
    if i % 10 == 0:
        print(f"Step {i}: loss = {loss.item():.4f}")

Understanding the Equivalence

The neural network approach produces identical results to counting because:

  1. One-hot encoding + matrix multiplication = table lookup
  2. Exponentiating logits = converting log-counts to counts
  3. Normalizing = creating probability distributions

The weight matrix W learns to store the same log-counts that the counting method computed directly.

Model Evaluation

Measure model quality using negative log-likelihood on the training set:

1
2
3
4
5
6
7
def evaluate_loss(probs, ys):
    """Calculate average negative log-likelihood"""
    log_probs = probs[torch.arange(len(ys)), ys].log()
    return -log_probs.mean()

loss = evaluate_loss(probs, ys)
print(f"Training loss: {loss.item():.4f}")

Lower loss indicates better model performance. A perfect model (always predicting correctly) achieves loss = 0.

Model Smoothing and Regularization

Prevent zero probabilities by adding small counts (smoothing) or regularization:

1
2
3
4
5
6
7
8
9
# Smoothing: add fake counts
N_smooth = N + 1  # Add 1 to all counts
P_smooth = N_smooth.float() / N_smooth.sum(1, keepdim=True)

# Regularization: encourage small weights
def regularized_loss(probs, ys, W, reg_strength=0.01):
    nll = -probs[torch.arange(len(ys)), ys].log().mean()
    reg = reg_strength * (W**2).mean()
    return nll + reg

Sampling from the Neural Network

Generate text using the trained neural network:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def sample_neural(W, i2s, seed=42):
    g = torch.Generator().manual_seed(seed)
    
    name = []
    ix = 0  # Start token
    
    while True:
        # Convert to one-hot and get probabilities
        x_enc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
        logits = x_enc @ W
        probs = F.softmax(logits, dim=1)
        
        # Sample next character
        ix = torch.multinomial(probs, 1, generator=g).item()
        if ix == 0:  # End token
            break
        name.append(i2s[ix])
    
    return ''.join(name)

Key Insights

Scalability: The neural network approach extends naturally to longer contexts by changing the input representation, while counting becomes impractical.

Flexibility: Neural networks can incorporate complex architectures (multiple layers, attention mechanisms) while maintaining the same training framework.

Foundation: This simple model demonstrates the core concepts used in modern language models like GPT-2: convert text to vectors, process through neural networks, output probability distributions, and optimize with gradient descent.

Next Steps

This bigram model forms the foundation for more sophisticated architectures:

  • Multi-layer perceptrons: Add hidden layers for more complex patterns
  • Recurrent networks: Process variable-length sequences
  • Transformers: Use attention mechanisms for long-range dependencies

The training framework remains identical—only the neural network architecture changes. Understanding this progression from simple bigrams to transformers reveals how modern language models work under the hood.