/ module 03

Pre-training

Pre-training is when a model learns the statistics of language from raw text. The objective is simple: given the tokens so far, predict the next one. Repeat trillions of times across the internet.

objective

Next-token prediction

Every modern LLM is trained on one task: given a context, output a probability distribution over the next token. Loss = how surprised the model was by the actual next token (cross-entropy).

Real models use deep transformers and learn long-range patterns. The toy below uses a bigram, which only remembers the last word, but the mechanic is the same.

loss
How wrong predictions are. Lower = better.
temperature
Sampling randomness. 0 = greedy, >1 = wild.
parameters
The learned numbers (weights) inside the model.
Live lab · Train a bigram model
P(next word | seed)
cat30.0%
mat20.0%
dog20.0%
log10.0%
lazy10.0%
quick10.0%
Generate
Click Sample to generate text.

A real LLM does this with billions of parameters, attention over thousands of tokens, and a vocabulary of ~100k. The principle stays identical: predict the next token.

Live lab · Animated training run · A100 GPU

What does pre-training feel like? Below is a simulated job on an NVIDIA A100. Watch loss drop epoch by epoch, GPU utilization spike, VRAM fill up, and tokens stream through, all without a real GPU.

Model:
NVIDIA A100 · 80GB SXM
idle
util0%vram0/80 GBtemp34°Cpower35W
GPT-mini (124M) · 124M params
Epoch 1/6 · step 0/2400.0%
tokens seen
0
throughput
85.0k tok/s
elapsed
0.0s
eta
n/a
e1
e2
e3
e4
e5
e6
train.log
Press Start to stream epoch logs…

Real pre-training of a frontier model takes weeks across thousands of GPUs. The shape of the loss curve and the rhythm of epoch logs look just like this, only zoomed out.