/ module 02
Vocabulary
A model's vocabulary is the fixed set of tokens it knows. Building it well is a balancing act between sequence length, memory and coverage. Bigger vocab = shorter sequences but a larger embedding table.
concept
How vocabularies are built
The simplest approach: count every word in a corpus, keep the top N most frequent. Modern tokenizers (BPE, WordPiece, Unigram) instead learn subword merges so rare words still decompose into known pieces. GPT-4 uses ~100k tokens; Llama 3 uses 128k.
vocab size
Count of unique tokens the model recognizes.
OOV
Out-of-vocabulary, meaning a word the tokenizer can't represent.
coverage
% of corpus words representable by the vocab.
Live lab · Vocabulary Builder
top 38 tokens
- #1the6
- #2a3
- #3sat2
- #4on2
- #5dog2
- #6of2
- #7tokens2
- #8cat1
- #9mat1
- #10log1
- #11quick1
- #12brown1
- #13fox1
- #14jumps1
- #15over1
- #16lazy1
- #17language1
- #18models1
- #19learn1
- #20from1
- #21large1
- #22amounts1
- #23text1
- #24vocabulary1
- #25is1
- #26finite1
- #27set1
- #28that1
- #29model1
- #30recognizes1
- #31rare1
- #32words1
- #33may1
- #34be1
- #35split1
- #36into1
- #37multiple1
- #38subword1
Unique words
38
Total words
50
Coverage
100.0%
Shrink the vocab to feel the trade-off: rare words drop out first.