/ module 03

Pre-training

Pre-training is when a model learns the statistics of language from raw text. The objective is simple: given the tokens so far, predict the next one. Repeat trillions of times across the internet.

objective

Next-token prediction

Every modern LLM is trained on one task: given a context, output a probability distribution over the next token. Loss = how surprised the model was by the actual next token (cross-entropy).

Real models use deep transformers and learn long-range patterns. The toy below uses a bigram, which only remembers the last word, but the mechanic is the same.

loss

How wrong predictions are. Lower = better.

temperature

Sampling randomness. 0 = greedy, >1 = wild.

parameters

The learned numbers (weights) inside the model.

Live lab · Train a bigram model

Training corpus

P(next word | seed)

cat30.0%

mat20.0%

dog20.0%

log10.0%

lazy10.0%

quick10.0%

Generate

Steps: 12Temperature: 0.80

Click Sample to generate text.

A real LLM does this with billions of parameters, attention over thousands of tokens, and a vocabulary of ~100k. The principle stays identical: predict the next token.

Live lab · Animated training run · A100 GPU

What does pre-training feel like? Below is a simulated job on an NVIDIA A100. Watch loss drop epoch by epoch, GPU utilization spike, VRAM fill up, and tokens stream through, all without a real GPU.

Model:

Epochs: 6Steps/epoch: 40Animation speed: 60ms/step

NVIDIA A100 · 80GB SXM

idle

util0%vram0/80 GBtemp34°Cpower35W

speedGPT-mini (124M) · 124M params

Epoch 1/6 · step 0/2400.0%

tokens seen

throughput

85.0k tok/s

elapsed

0.0s

eta

n/a

train.log

Press Start to stream epoch logs…

Real pre-training of a frontier model takes weeks across thousands of GPUs. The shape of the loss curve and the rhythm of epoch logs look just like this, only zoomed out.