/ module 02

Vocabulary

A model's vocabulary is the fixed set of tokens it knows. Building it well is a balancing act between sequence length, memory and coverage. Bigger vocab = shorter sequences but a larger embedding table.

concept

How vocabularies are built

The simplest approach: count every word in a corpus, keep the top N most frequent. Modern tokenizers (BPE, WordPiece, Unigram) instead learn subword merges so rare words still decompose into known pieces. GPT-4 uses ~100k tokens; Llama 3 uses 128k.

vocab size
Count of unique tokens the model recognizes.
OOV
Out-of-vocabulary, meaning a word the tokenizer can't represent.
coverage
% of corpus words representable by the vocab.
Live lab · Vocabulary Builder
top 38 tokens
  • #1the6
  • #2a3
  • #3sat2
  • #4on2
  • #5dog2
  • #6of2
  • #7tokens2
  • #8cat1
  • #9mat1
  • #10log1
  • #11quick1
  • #12brown1
  • #13fox1
  • #14jumps1
  • #15over1
  • #16lazy1
  • #17language1
  • #18models1
  • #19learn1
  • #20from1
  • #21large1
  • #22amounts1
  • #23text1
  • #24vocabulary1
  • #25is1
  • #26finite1
  • #27set1
  • #28that1
  • #29model1
  • #30recognizes1
  • #31rare1
  • #32words1
  • #33may1
  • #34be1
  • #35split1
  • #36into1
  • #37multiple1
  • #38subword1
Unique words
38
Total words
50
Coverage
100.0%
Shrink the vocab to feel the trade-off: rare words drop out first.