Tuesday, June 2, 2026

Tokenization in Plain English

 The core problem: Computers only understand numbers. Text is not numbers. Tokenization is the bridge.


What is a token?

A token is a "chunk" of text — sometimes a word, sometimes part of a word, sometimes punctuation.

Code
"I love chatbots" → ["I", " love", " chat", "bots"]

Why "chatbots" gets split: it's rarer than "chat" and "bots" individually, and the model already knows those pieces.


How similar words are grouped (Embeddings)

This is the magic part. Once each token gets an ID number, that ID maps to a vector (a list of ~4096 numbers). These vectors are positioned in space such that:

WordsRelationship
"king" and "queen"Close together (both royalty)
"cancel" and "disconnect"Close together (similar intent)
"pizza" and "disconnect"Far apart (unrelated)

So even though "cancel" (ID 12074) and "disconnect" (ID 29482) have totally different IDs, their vectors point in nearly the same direction. The model learns this during training.


What is the vocabulary?

The vocab is simply the complete list of all token pieces the model knows — like a dictionary, but of subword chunks rather than full words.

  • GPT-4's vocab: ~100,000 tokens
  • LLaMA 3's vocab: ~128,000 tokens

If a word isn't in the vocab as-is, it gets split into pieces that are:

Code
"Comcast"       → ["Com", "cast"]         ✅ both in vocab
"unforgettable" → ["un", "forget", "table"] ✅ all in vocab

Nothing is ever truly "unknown" — the model can always assemble it from smaller known pieces.


The LEGO analogy (from your doc) is perfect

  • Vocab = your box of LEGO pieces (~100K–150K pieces)
  • Tokenization = breaking a sentence into those pieces
  • Embedding = each piece has a hidden "meaning coordinate" so similar pieces sit near each other
  • The transformer = the brain that reads those coordinates and figures out what you're saying

TL;DR:
Text → split into subword chunks (tokens) → each chunk gets a number (ID) → each ID maps to a meaning-vector → similar meanings get similar vectors → the model works with those vectors.

No comments:

Post a Comment