Tuesday, June 2, 2026

Embeddings — Deep Dive

1. What IS an Embedding?

An embedding is a list of numbers that represents the meaning of a token. Think of it as GPS coordinates, but instead of 2 dimensions (latitude, longitude), you have thousands of dimensions.

Code
Real GPS (2 dimensions):
  New York  → [40.71, -74.00]
  New Jersey → [40.05, -74.40]   ← nearby = geographically similar
  Tokyo     → [35.68, 139.69]   ← far away

Token Embedding (4096 dimensions):
  "cancel"    → [0.82, -0.14, 0.91, 0.03, -0.67, ... 4096 numbers]
  "terminate" → [0.79, -0.11, 0.88, 0.05, -0.64, ... ]  ← nearby = similar meaning
  "pizza"     → [-0.45, 0.72, -0.33, 0.64, 0.21, ... ]  ← far away = unrelated

Each of those 4096 numbers captures some aspect of meaning — we don't always know what each dimension means, but collectively they encode:

  • Is this a verb or noun?
  • Is it positive or negative?
  • Is it formal or casual?
  • Is it about food? technology? emotions?

2. How Are Embeddings Learned?

Start: Completely Random

Code
Before training (Day 0):
  "king"   → [0.52, 0.11, -0.83, 0.44, ...]   ← random garbage
  "queen"  → [-0.31, 0.67, 0.02, -0.91, ...]   ← random garbage
  "pizza"  → [0.19, -0.55, 0.73, 0.28, ...]    ← random garbage

  All words are equally "far apart" — no meaning yet.

Training: Learn from Context

The model reads trillions of sentences and learns:

Code
Sentence 1: "The king ruled the kingdom"
Sentence 2: "The queen ruled the kingdom"
Sentence 3: "I ordered a pizza"

Model notices:
  - "king" and "queen" appear in SAME contexts 
    (before "ruled", after "the", with "kingdom")
  - "pizza" appears in COMPLETELY DIFFERENT contexts 
    (after "ordered", with "toppings", "delivery")

So training pushes:
  - "king" vector TOWARD "queen" vector (similar contexts)
  - "pizza" vector AWAY from both (different contexts)

After Billions of Sentences:

Code
After training:
  "king"   → [0.83, -0.21, 0.67, 0.45, ...]
  "queen"  → [0.81, -0.18, 0.64, 0.42, ...]   ← very close!
  "prince" → [0.78, -0.25, 0.61, 0.39, ...]   ← also close!
  "pizza"  → [-0.44, 0.71, -0.28, 0.63, ...]  ← far away

The famous example:

Code
king - man + woman ≈ queen   (vector arithmetic actually works!)

3. How Does It Store Names, Proper Nouns, Standard Words?

Standard Words (common):

Code
"the"      → Single token, ID 279
              Embedding learned from trillions of occurrences
              Vector captures: "article, neutral, precedes nouns"

"customer" → Single token, ID 12043  
              Vector captures: "person, business context, receives service"

Names and Rare Words (get split):

Code
"Comcast"    → ["Com", "cast"] → IDs [1568, 4384]
               Each piece gets its OWN embedding vector
               The transformer layers COMBINE them to understand 
               "this refers to the company Comcast"

"Teradata"   → ["Ter", "adata"] → IDs [7321, 18294]

"John"       → ["John"] → ID [2782]  (common name = single token)

"Kuznetsov"  → ["K", "uz", "nets", "ov"] → IDs [42, 3712, 17843, 869]

Key insight: For split words, the meaning isn't in any single token's embedding — it emerges when the transformer layers process them together with attention.

Code
"Com" alone could mean: company, combination, communication, comedy...
"cast" alone could mean: throw, actors, plaster cast...

But "Com" + "cast" in context → the model figures out "it's the telecom company"
(This is the transformer's job, not the embedding's job alone)

4. Embedding Sizes Across Models

ModelVocab SizeEmbedding DimensionsTotal Parameters (embedding layer alone)
GPT-250,257768~38 million
GPT-350,25712,288~617 million
GPT-4~100,000~8,192–12,288 (estimated)~800M–1.2B
LLaMA 232,0004,096~131 million
LLaMA 3128,2564,096 (8B) / 8,192 (70B)~525M / ~1B
Mistral 7B32,0004,096~131 million
BERT (base)30,522768~23 million

Why bigger models have larger embeddings: More dimensions = more nuance. Like describing a person with 5 adjectives vs. 4,096 adjectives — the more dimensions, the more precisely you can differentiate meanings.


5. Who Prepares the Initial Vocabulary and How?

The vocabulary is built before training using an algorithm called BPE (Byte Pair Encoding). Here's how:

Step-by-Step BPE Example:

Code
Training corpus (simplified):
  "low low low low low"
  "lower lower"  
  "newest newest newest"
  "widest"

Step 1: Start with individual characters as vocab
  Vocab: {l, o, w, e, r, n, s, t, d, i, ...}

Step 2: Count which pairs of characters appear most often
  "lo" appears 7 times (5 in "low" + 2 in "lower")
  "ow" appears 7 times
  "ne" appears 3 times
  "we" appears 3 times
  ...

Step 3: Merge the most frequent pair → "lo" becomes one token
  Vocab: {l, o, w, e, r, n, s, t, d, i, ..., "lo"}

Step 4: Recount pairs with the merged token
  "lo"+"w" appears 7 times → merge!
  Vocab: {..., "lo", "low"}

Step 5: Keep merging until you reach desired vocab size
  "low" + "er" → "lower"
  "new" + "est" → "newest"
  ...

Final vocab (after 128,000 merges):
  Single characters: a, b, c, ...
  Common pieces: "ing", "tion", "un", "re", ...
  Full common words: "the", "and", "customer", ...
  

Who does this?

The model creators (Meta, OpenAI, Google) run BPE on their training corpus once before training starts. It's a preprocessing step.

Code
Meta (LLaMA 3):
  1. Collected training text (web, books, code...)
  2. Ran BPE algorithm on that text
  3. Chose 128,256 as target vocab size
  4. Output: tokenizer.model file (the vocab)
  5. THEN started training the actual LLM

6. The Big Question: "English Didn't Change, So Why Do Vocabs Differ?"

Great intuition — but here's why vocabs do change between models:

Reason 1: Different Training Data

Code
LLaMA 2 (trained 2023):
  - Mostly English text, books, web
  - Less code
  - Result: 32,000 tokens, optimized for English

LLaMA 3 (trained 2024):
  - Much more code (Python, JS, etc.)
  - More multilingual (Hindi, Chinese, etc.)
  - Result: 128,256 tokens

New tokens added for code:
  "def", "self", "import", "    " (4 spaces = 1 token for indentation!)
  
New tokens added for other languages:
  "的", "है", "это" (common Chinese/Hindi/Russian pieces)

Reason 2: Different Vocab Sizes = Different Splits

Code
Small vocab (32,000 tokens):
  "embeddings" → ["embed", "dings"]     (2 tokens)
  "transformer" → ["trans", "former"]    (2 tokens)
  
Large vocab (128,000 tokens):
  "embeddings" → ["embeddings"]          (1 token!)  
  "transformer" → ["transformer"]        (1 token!)
  
Bigger vocab = fewer tokens per sentence = faster processing
But bigger vocab = larger embedding matrix = more memory

Reason 3: Different Optimization Goals

Code
GPT-4 (OpenAI):
  - Optimized for conversation, reasoning
  - Kept common English phrases as single tokens
  - " cannot" = 1 token (note the space is included!)

Code-specific models (CodeLlama):
  - Optimized for programming
  - "function", "return", "async" = single tokens
  - "    " (indentation) = single token

Reason 4: Efficiency Tradeoffs

Code
The tradeoff:
┌─────────────────────────────────────────────────┐
│  Smaller Vocab (32K)           Larger Vocab (128K) │
│  ─────────────────            ──────────────────── │
│  ✅ Less memory                ✅ Fewer tokens/sentence │
│  ✅ Smaller embedding table    ✅ Faster inference │
│  ❌ More tokens per sentence   ❌ Huge embedding table │
│  ❌ Slower inference           ❌ More memory │
└─────────────────────────────────────────────────┘

Summary Table

QuestionAnswer
What is an embedding?A vector of numbers that encodes meaning
How is it learned?By observing which words appear in similar contexts across billions of sentences
Names/rare words?Split into subword pieces; transformer combines them for meaning
Who makes the vocab?Model creators, using BPE on training data, before training starts
Does vocab change across models?Yes — different data, different sizes, different goals
Why change if English is the same?More languages, more code, efficiency tradeoffs, and different corpus compositions

The analogy:

English alphabet hasn't changed — but dictionaries differ (Oxford vs. Webster vs. Urban Dictionary). Each is optimized for different audiences. Same idea with LLM vocabularies — same language, different "dictionaries" optimized for different jobs.

No comments:

Post a Comment