An embedding is a list of numbers that represents the meaning of a token. Think of it as GPS coordinates, but instead of 2 dimensions (latitude, longitude), you have thousands of dimensions.
Code
Real GPS (2 dimensions):
New York → [40.71, -74.00]
New Jersey → [40.05, -74.40] ← nearby = geographically similar
Tokyo → [35.68, 139.69] ← far away
Token Embedding (4096 dimensions):
"cancel" → [0.82, -0.14, 0.91, 0.03, -0.67, ... 4096 numbers]
"terminate" → [0.79, -0.11, 0.88, 0.05, -0.64, ... ] ← nearby = similar meaning
"pizza" → [-0.45, 0.72, -0.33, 0.64, 0.21, ... ] ← far away = unrelated
Each of those 4096 numbers captures some aspect of meaning — we don't always know what each dimension means, but collectively they encode:
Is this a verb or noun?
Is it positive or negative?
Is it formal or casual?
Is it about food? technology? emotions?
2. How Are Embeddings Learned?
Start: Completely Random
Code
Before training (Day 0):
"king" → [0.52, 0.11, -0.83, 0.44, ...] ← random garbage
"queen" → [-0.31, 0.67, 0.02, -0.91, ...] ← random garbage
"pizza" → [0.19, -0.55, 0.73, 0.28, ...] ← random garbage
All words are equally "far apart" — no meaning yet.
Training: Learn from Context
The model reads trillions of sentences and learns:
Code
Sentence 1: "The king ruled the kingdom"
Sentence 2: "The queen ruled the kingdom"
Sentence 3: "I ordered a pizza"
Model notices:
- "king" and "queen" appear in SAME contexts
(before "ruled", after "the", with "kingdom")
- "pizza" appears in COMPLETELY DIFFERENT contexts
(after "ordered", with "toppings", "delivery")
So training pushes:
- "king" vector TOWARD "queen" vector (similar contexts)
- "pizza" vector AWAY from both (different contexts)
After Billions of Sentences:
Code
After training:
"king" → [0.83, -0.21, 0.67, 0.45, ...]
"queen" → [0.81, -0.18, 0.64, 0.42, ...] ← very close!
"prince" → [0.78, -0.25, 0.61, 0.39, ...] ← also close!
"pizza" → [-0.44, 0.71, -0.28, 0.63, ...] ← far away
The famous example:
Code
king - man + woman ≈ queen (vector arithmetic actually works!)
3. How Does It Store Names, Proper Nouns, Standard Words?
Standard Words (common):
Code
"the" → Single token, ID 279
Embedding learned from trillions of occurrences
Vector captures: "article, neutral, precedes nouns"
"customer" → Single token, ID 12043
Vector captures: "person, business context, receives service"
Names and Rare Words (get split):
Code
"Comcast" → ["Com", "cast"] → IDs [1568, 4384]
Each piece gets its OWN embedding vector
The transformer layers COMBINE them to understand
"this refers to the company Comcast"
"Teradata" → ["Ter", "adata"] → IDs [7321, 18294]
"John" → ["John"] → ID [2782] (common name = single token)
"Kuznetsov" → ["K", "uz", "nets", "ov"] → IDs [42, 3712, 17843, 869]
Key insight: For split words, the meaning isn't in any single token's embedding — it emerges when the transformer layers process them together with attention.
Code
"Com" alone could mean: company, combination, communication, comedy...
"cast" alone could mean: throw, actors, plaster cast...
But "Com" + "cast" in context → the model figures out "it's the telecom company"
(This is the transformer's job, not the embedding's job alone)
4. Embedding Sizes Across Models
Model
Vocab Size
Embedding Dimensions
Total Parameters (embedding layer alone)
GPT-2
50,257
768
~38 million
GPT-3
50,257
12,288
~617 million
GPT-4
~100,000
~8,192–12,288 (estimated)
~800M–1.2B
LLaMA 2
32,000
4,096
~131 million
LLaMA 3
128,256
4,096 (8B) / 8,192 (70B)
~525M / ~1B
Mistral 7B
32,000
4,096
~131 million
BERT (base)
30,522
768
~23 million
Why bigger models have larger embeddings: More dimensions = more nuance. Like describing a person with 5 adjectives vs. 4,096 adjectives — the more dimensions, the more precisely you can differentiate meanings.
5. Who Prepares the Initial Vocabulary and How?
The vocabulary is built before training using an algorithm called BPE (Byte Pair Encoding). Here's how:
Step-by-Step BPE Example:
Code
Training corpus (simplified):
"low low low low low"
"lower lower"
"newest newest newest"
"widest"
Step 1: Start with individual characters as vocab
Vocab: {l, o, w, e, r, n, s, t, d, i, ...}
Step 2: Count which pairs of characters appear most often
"lo" appears 7 times (5 in "low" + 2 in "lower")
"ow" appears 7 times
"ne" appears 3 times
"we" appears 3 times
...
Step 3: Merge the most frequent pair → "lo" becomes one token
Vocab: {l, o, w, e, r, n, s, t, d, i, ..., "lo"}
Step 4: Recount pairs with the merged token
"lo"+"w" appears 7 times → merge!
Vocab: {..., "lo", "low"}
Step 5: Keep merging until you reach desired vocab size
"low" + "er" → "lower"
"new" + "est" → "newest"
...
Final vocab (after 128,000 merges):
Single characters: a, b, c, ...
Common pieces: "ing", "tion", "un", "re", ...
Full common words: "the", "and", "customer", ...
Who does this?
The model creators (Meta, OpenAI, Google) run BPE on their training corpus once before training starts. It's a preprocessing step.
Code
Meta (LLaMA 3):
1. Collected training text (web, books, code...)
2. Ran BPE algorithm on that text
3. Chose 128,256 as target vocab size
4. Output: tokenizer.model file (the vocab)
5. THEN started training the actual LLM
6. The Big Question: "English Didn't Change, So Why Do Vocabs Differ?"
Great intuition — but here's why vocabs do change between models:
Reason 1: Different Training Data
Code
LLaMA 2 (trained 2023):
- Mostly English text, books, web
- Less code
- Result: 32,000 tokens, optimized for English
LLaMA 3 (trained 2024):
- Much more code (Python, JS, etc.)
- More multilingual (Hindi, Chinese, etc.)
- Result: 128,256 tokens
New tokens added for code:
"def", "self", "import", " " (4 spaces = 1 token for indentation!)
New tokens added for other languages:
"的", "है", "это" (common Chinese/Hindi/Russian pieces)
Reason 2: Different Vocab Sizes = Different Splits
Code
Small vocab (32,000 tokens):
"embeddings" → ["embed", "dings"] (2 tokens)
"transformer" → ["trans", "former"] (2 tokens)
Large vocab (128,000 tokens):
"embeddings" → ["embeddings"] (1 token!)
"transformer" → ["transformer"] (1 token!)
Bigger vocab = fewer tokens per sentence = faster processing
But bigger vocab = larger embedding matrix = more memory
Reason 3: Different Optimization Goals
Code
GPT-4 (OpenAI):
- Optimized for conversation, reasoning
- Kept common English phrases as single tokens
- " cannot" = 1 token (note the space is included!)
Code-specific models (CodeLlama):
- Optimized for programming
- "function", "return", "async" = single tokens
- " " (indentation) = single token
By observing which words appear in similar contexts across billions of sentences
Names/rare words?
Split into subword pieces; transformer combines them for meaning
Who makes the vocab?
Model creators, using BPE on training data, before training starts
Does vocab change across models?
Yes — different data, different sizes, different goals
Why change if English is the same?
More languages, more code, efficiency tradeoffs, and different corpus compositions
The analogy:
English alphabet hasn't changed — but dictionaries differ (Oxford vs. Webster vs. Urban Dictionary). Each is optimized for different audiences. Same idea with LLM vocabularies — same language, different "dictionaries" optimized for different jobs.
No comments:
Post a Comment