/ module 03
Pre-training
Pre-training is when a model learns the statistics of language from raw text. The objective is simple: given the tokens so far, predict the next one. Repeat trillions of times across the internet.
objective
Next-token prediction
Every modern LLM is trained on one task: given a context, output a probability distribution over the next token. Loss = how surprised the model was by the actual next token (cross-entropy).
Real models use deep transformers and learn long-range patterns. The toy below uses a bigram, which only remembers the last word, but the mechanic is the same.
A real LLM does this with billions of parameters, attention over thousands of tokens, and a vocabulary of ~100k. The principle stays identical: predict the next token.
What does pre-training feel like? Below is a simulated job on an NVIDIA A100. Watch loss drop epoch by epoch, GPU utilization spike, VRAM fill up, and tokens stream through, all without a real GPU.
Real pre-training of a frontier model takes weeks across thousands of GPUs. The shape of the loss curve and the rhythm of epoch logs look just like this, only zoomed out.