AI-generated and human-reviewed news meta-commentary
Artificial Intelligence › Deep Learning
,
Artificial Intelligence › Natural Language Processing
Building Character-Level Language Models: From Bigrams to GPT-2
Step-by-step tutorial on building character-level language models, starting with simple bigram models and progressing to transformer architectures.
Signal Editorial Team
AI-generated and human-reviewed news meta-commentary
6 minute read
Building Character-Level Language Models: From Bigrams to GPT-2
Character-level language models predict the next character in a sequence by learning patterns from training data. This tutorial builds these models step-by-step, starting with simple bigram counting and progressing to neural network implementations that form the foundation of modern transformers.
Understanding Character-Level Language Models
A character-level language model treats text as sequences of individual characters. For the name “REESE”, the model sees the sequence R-E-E-S-E and learns to predict each next character given the previous context.
These models excel at generating text that follows learned patterns. Train one on a dataset of names, and it generates new, name-like sequences that sound plausible but don’t exist in the original data.
Building a Bigram Model
The simplest character-level model is a bigram model, which predicts the next character using only the immediately preceding character.
Preparing the Data
Start with a dataset of names and extract all character pairs (bigrams):
1
2
3
4
5
6
7
8
9
10
11
12
importtorchimportmatplotlib.pyplotasplt# Load names datasetwords=open('names.txt','r').read().splitlines()# Create bigrams with special start/end tokensbigrams=[]forwordinwords:chars=['.']+list(word)+['.']# . marks start/endforch1,ch2inzip(chars,chars[1:]):bigrams.append((ch1,ch2))
Counting and Normalizing
Count bigram frequencies and convert to probabilities:
# Count bigram occurrencescounts={}forch1,ch2inbigrams:counts[(ch1,ch2)]=counts.get((ch1,ch2),0)+1# Convert to 2D array for efficiencychars=sorted(list(set(''.join(words))))chars=['.']+chars# Start token first# Create lookup tabless2i={s:ifori,sinenumerate(chars)}i2s={i:sfors,iins2i.items()}# Build count matrixN=torch.zeros((27,27),dtype=torch.int32)forch1,ch2inbigrams:ix1,ix2=s2i[ch1],s2i[ch2]N[ix1,ix2]+=1# Convert to probabilitiesP=N.float()P=P/P.sum(1,keepdim=True)# Normalize rows
Sampling from the Model
Generate new names by following the probability distributions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
defsample_bigram(P,i2s,seed=42):g=torch.Generator().manual_seed(seed)name=[]ix=0# Start with '.'whileTrue:p=P[ix]ix=torch.multinomial(p,1,replacement=True,generator=g).item()ifix==0:# End tokenbreakname.append(i2s[ix])return''.join(name)# Generate samplesfor_inrange(5):print(sample_bigram(P,i2s))
Neural Network Implementation
The counting approach works for bigrams but doesn’t scale to longer contexts. Neural networks provide a flexible alternative that achieves identical results for bigrams while enabling future extensions.
Creating Training Data
Convert bigrams to input-output pairs for neural network training:
1
2
3
4
5
6
7
8
# Prepare training dataxs,ys=[],[]forch1,ch2inbigrams:xs.append(s2i[ch1])ys.append(s2i[ch2])xs=torch.tensor(xs)ys=torch.tensor(ys)
One-Hot Encoding
Neural networks need vector inputs, not integers:
1
2
3
4
5
importtorch.nn.functionalasF# Convert integers to one-hot vectorsx_enc=F.one_hot(xs,num_classes=27).float()print(f"Shape: {x_enc.shape}")# [num_examples, 27]
Building the Neural Network
Create a single linear layer that maps 27 inputs to 27 outputs:
Exponentiating logits = converting log-counts to counts
Normalizing = creating probability distributions
The weight matrix W learns to store the same log-counts that the counting method computed directly.
Model Evaluation
Measure model quality using negative log-likelihood on the training set:
1
2
3
4
5
6
7
defevaluate_loss(probs,ys):"""Calculate average negative log-likelihood"""log_probs=probs[torch.arange(len(ys)),ys].log()return-log_probs.mean()loss=evaluate_loss(probs,ys)print(f"Training loss: {loss.item():.4f}")
Lower loss indicates better model performance. A perfect model (always predicting correctly) achieves loss = 0.
Model Smoothing and Regularization
Prevent zero probabilities by adding small counts (smoothing) or regularization:
1
2
3
4
5
6
7
8
9
# Smoothing: add fake countsN_smooth=N+1# Add 1 to all countsP_smooth=N_smooth.float()/N_smooth.sum(1,keepdim=True)# Regularization: encourage small weightsdefregularized_loss(probs,ys,W,reg_strength=0.01):nll=-probs[torch.arange(len(ys)),ys].log().mean()reg=reg_strength*(W**2).mean()returnnll+reg
Sampling from the Neural Network
Generate text using the trained neural network:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
defsample_neural(W,i2s,seed=42):g=torch.Generator().manual_seed(seed)name=[]ix=0# Start tokenwhileTrue:# Convert to one-hot and get probabilitiesx_enc=F.one_hot(torch.tensor([ix]),num_classes=27).float()logits=x_enc@Wprobs=F.softmax(logits,dim=1)# Sample next characterix=torch.multinomial(probs,1,generator=g).item()ifix==0:# End tokenbreakname.append(i2s[ix])return''.join(name)
Key Insights
Scalability: The neural network approach extends naturally to longer contexts by changing the input representation, while counting becomes impractical.
Flexibility: Neural networks can incorporate complex architectures (multiple layers, attention mechanisms) while maintaining the same training framework.
Foundation: This simple model demonstrates the core concepts used in modern language models like GPT-2: convert text to vectors, process through neural networks, output probability distributions, and optimize with gradient descent.
Next Steps
This bigram model forms the foundation for more sophisticated architectures:
Multi-layer perceptrons: Add hidden layers for more complex patterns
Recurrent networks: Process variable-length sequences
Transformers: Use attention mechanisms for long-range dependencies
The training framework remains identical—only the neural network architecture changes. Understanding this progression from simple bigrams to transformers reveals how modern language models work under the hood.