/ module 02

Vocabulary

A model's vocabulary is the fixed set of tokens it knows. Building it well is a balancing act between sequence length, memory and coverage. Bigger vocab = shorter sequences but a larger embedding table.

concept

How vocabularies are built

The simplest approach: count every word in a corpus, keep the top N most frequent. Modern tokenizers (BPE, WordPiece, Unigram) instead learn subword merges so rare words still decompose into known pieces. GPT-4 uses ~100k tokens; Llama 3 uses 128k.

vocab size

Count of unique tokens the model recognizes.

OOV

Out-of-vocabulary, meaning a word the tokenizer can't represent.

coverage

% of corpus words representable by the vocab.

Live lab · Vocabulary Builder

Corpus

Vocab size: 40

top 38 tokens

#1the6
#2a3
#3sat2
#4on2
#5dog2
#6of2
#7tokens2
#8cat1
#9mat1
#10log1
#11quick1
#12brown1
#13fox1
#14jumps1
#15over1
#16lazy1
#17language1
#18models1
#19learn1
#20from1
#21large1
#22amounts1
#23text1
#24vocabulary1
#25is1
#26finite1
#27set1
#28that1
#29model1
#30recognizes1
#31rare1
#32words1
#33may1
#34be1
#35split1
#36into1
#37multiple1
#38subword1

Unique words

Total words

Coverage

100.0%

Shrink the vocab to feel the trade-off: rare words drop out first.