Tuesday, June 2, 2026

The Transformer — High Level

 


The Core Idea

A transformer is a neural network architecture that reads all tokens at once (not one by one like older models) and figures out which tokens should pay attention to which other tokens to understand meaning.


The Analogy

Imagine you're reading this sentence:

Code
"The bank was steep, so I didn't jump into the river"

When you hit the word "bank," your brain looks back at "river" to decide it means "riverbank" not "financial bank." That's exactly what a transformer does — but mathematically.


The Architecture (Simplified)

Code
Input: "Customer wants to cancel service"
         │
         ▼
┌─────────────────────────┐
│  1. TOKENIZATION        │  → [12043, 6592, 311, 12074, 2532]
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  2. EMBEDDING LAYER     │  → Each ID becomes a 4096-dim vector
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  3. POSITIONAL ENCODING │  → Adds "I'm the 1st/2nd/3rd word" info
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  4. TRANSFORMER BLOCKS  │  ← This is the magic (repeated 32-96 times)
│     ┌─────────────────┐ │
│     │ Self-Attention   │ │  → "Which other words matter for THIS word?"
│     └─────────────────┘ │
│     ┌─────────────────┐ │
│     │ Feed-Forward NN  │ │  → "Now process what I learned"
│     └─────────────────┘ │
│     (repeat N times)     │
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  5. OUTPUT HEAD         │  → Predicts the next token (probability over vocab)
└─────────────────────────┘
         │
         ▼
Output: "cancellation" (most probable next token)

How Attention Works (The Key Innovation)

Code
Sentence: "The cat sat on the mat because it was tired"

Question the model asks: What does "it" refer to?

Attention scores for "it":
  "The"     → 0.02  (low)
  "cat"     → 0.71  (HIGH — "it" = the cat!)
  "sat"     → 0.04  (low)
  "on"      → 0.01  (low)
  "the"     → 0.01  (low)
  "mat"     → 0.18  (some attention — could be the mat)
  "because" → 0.03  (low)

The model learns to "attend" to the right words to build understanding.

Multi-Head Attention — Why "Multi"?

One attention head might focus on grammar (subject-verb agreement).
Another head might focus on meaning (what does "it" refer to?).
Another might focus on position (what's nearby?).

Code
Head 1 (grammar):     "it" attends to "cat" (noun agreement)
Head 2 (coreference): "it" attends to "cat" (pronoun resolution)  
Head 3 (proximity):   "it" attends to "was" (adjacent word)

All heads combined → rich understanding of each token's role

A model with 32 heads runs 32 attention patterns in parallel, then combines them.


Why Repeat Layers?

Code
Layer 1-4:   Low-level patterns (grammar, syntax, word boundaries)
Layer 5-16:  Mid-level patterns (phrases, entities, relationships)
Layer 17-32: High-level patterns (reasoning, tone, intent, world knowledge)

Early layers:  "cancel" is a verb, "service" is a noun
Middle layers: "cancel service" is a customer action
Deep layers:   "This person is unhappy and wants to churn"

Each layer refines the understanding built by the previous layer.


Summary of Each Component (Few Lines)

LayerWhat It Does
TokenizerSplits text into subword pieces, assigns IDs
EmbeddingConverts IDs into meaning-vectors
Positional EncodingTells the model word ORDER (since attention has no inherent sequence)
Self-AttentionEach token asks "which other tokens are relevant to me?" and pulls info from them
Feed-Forward NetworkProcesses the attention output — adds non-linearity, stores factual knowledge
Layer NormStabilizes numbers between layers so training doesn't explode
Residual ConnectionsAdds the input back to the output of each block (prevents "forgetting" earlier info)
Output Head (LM Head)Maps final vectors back to vocab-sized probabilities → picks the next token

One-liner: A transformer is a stack of attention layers that let every word look at every other word to figure out meaning, repeated dozens of times until the model deeply "understands" the input and can predict what comes next

Embeddings — Deep Dive

1. What IS an Embedding?

An embedding is a list of numbers that represents the meaning of a token. Think of it as GPS coordinates, but instead of 2 dimensions (latitude, longitude), you have thousands of dimensions.

Code
Real GPS (2 dimensions):
  New York  → [40.71, -74.00]
  New Jersey → [40.05, -74.40]   ← nearby = geographically similar
  Tokyo     → [35.68, 139.69]   ← far away

Token Embedding (4096 dimensions):
  "cancel"    → [0.82, -0.14, 0.91, 0.03, -0.67, ... 4096 numbers]
  "terminate" → [0.79, -0.11, 0.88, 0.05, -0.64, ... ]  ← nearby = similar meaning
  "pizza"     → [-0.45, 0.72, -0.33, 0.64, 0.21, ... ]  ← far away = unrelated

Each of those 4096 numbers captures some aspect of meaning — we don't always know what each dimension means, but collectively they encode:

  • Is this a verb or noun?
  • Is it positive or negative?
  • Is it formal or casual?
  • Is it about food? technology? emotions?

2. How Are Embeddings Learned?

Start: Completely Random

Code
Before training (Day 0):
  "king"   → [0.52, 0.11, -0.83, 0.44, ...]   ← random garbage
  "queen"  → [-0.31, 0.67, 0.02, -0.91, ...]   ← random garbage
  "pizza"  → [0.19, -0.55, 0.73, 0.28, ...]    ← random garbage

  All words are equally "far apart" — no meaning yet.

Training: Learn from Context

The model reads trillions of sentences and learns:

Code
Sentence 1: "The king ruled the kingdom"
Sentence 2: "The queen ruled the kingdom"
Sentence 3: "I ordered a pizza"

Model notices:
  - "king" and "queen" appear in SAME contexts 
    (before "ruled", after "the", with "kingdom")
  - "pizza" appears in COMPLETELY DIFFERENT contexts 
    (after "ordered", with "toppings", "delivery")

So training pushes:
  - "king" vector TOWARD "queen" vector (similar contexts)
  - "pizza" vector AWAY from both (different contexts)

After Billions of Sentences:

Code
After training:
  "king"   → [0.83, -0.21, 0.67, 0.45, ...]
  "queen"  → [0.81, -0.18, 0.64, 0.42, ...]   ← very close!
  "prince" → [0.78, -0.25, 0.61, 0.39, ...]   ← also close!
  "pizza"  → [-0.44, 0.71, -0.28, 0.63, ...]  ← far away

The famous example:

Code
king - man + woman ≈ queen   (vector arithmetic actually works!)

3. How Does It Store Names, Proper Nouns, Standard Words?

Standard Words (common):

Code
"the"      → Single token, ID 279
              Embedding learned from trillions of occurrences
              Vector captures: "article, neutral, precedes nouns"

"customer" → Single token, ID 12043  
              Vector captures: "person, business context, receives service"

Names and Rare Words (get split):

Code
"Comcast"    → ["Com", "cast"] → IDs [1568, 4384]
               Each piece gets its OWN embedding vector
               The transformer layers COMBINE them to understand 
               "this refers to the company Comcast"

"Teradata"   → ["Ter", "adata"] → IDs [7321, 18294]

"John"       → ["John"] → ID [2782]  (common name = single token)

"Kuznetsov"  → ["K", "uz", "nets", "ov"] → IDs [42, 3712, 17843, 869]

Key insight: For split words, the meaning isn't in any single token's embedding — it emerges when the transformer layers process them together with attention.

Code
"Com" alone could mean: company, combination, communication, comedy...
"cast" alone could mean: throw, actors, plaster cast...

But "Com" + "cast" in context → the model figures out "it's the telecom company"
(This is the transformer's job, not the embedding's job alone)

4. Embedding Sizes Across Models

ModelVocab SizeEmbedding DimensionsTotal Parameters (embedding layer alone)
GPT-250,257768~38 million
GPT-350,25712,288~617 million
GPT-4~100,000~8,192–12,288 (estimated)~800M–1.2B
LLaMA 232,0004,096~131 million
LLaMA 3128,2564,096 (8B) / 8,192 (70B)~525M / ~1B
Mistral 7B32,0004,096~131 million
BERT (base)30,522768~23 million

Why bigger models have larger embeddings: More dimensions = more nuance. Like describing a person with 5 adjectives vs. 4,096 adjectives — the more dimensions, the more precisely you can differentiate meanings.


5. Who Prepares the Initial Vocabulary and How?

The vocabulary is built before training using an algorithm called BPE (Byte Pair Encoding). Here's how:

Step-by-Step BPE Example:

Code
Training corpus (simplified):
  "low low low low low"
  "lower lower"  
  "newest newest newest"
  "widest"

Step 1: Start with individual characters as vocab
  Vocab: {l, o, w, e, r, n, s, t, d, i, ...}

Step 2: Count which pairs of characters appear most often
  "lo" appears 7 times (5 in "low" + 2 in "lower")
  "ow" appears 7 times
  "ne" appears 3 times
  "we" appears 3 times
  ...

Step 3: Merge the most frequent pair → "lo" becomes one token
  Vocab: {l, o, w, e, r, n, s, t, d, i, ..., "lo"}

Step 4: Recount pairs with the merged token
  "lo"+"w" appears 7 times → merge!
  Vocab: {..., "lo", "low"}

Step 5: Keep merging until you reach desired vocab size
  "low" + "er" → "lower"
  "new" + "est" → "newest"
  ...

Final vocab (after 128,000 merges):
  Single characters: a, b, c, ...
  Common pieces: "ing", "tion", "un", "re", ...
  Full common words: "the", "and", "customer", ...
  

Who does this?

The model creators (Meta, OpenAI, Google) run BPE on their training corpus once before training starts. It's a preprocessing step.

Code
Meta (LLaMA 3):
  1. Collected training text (web, books, code...)
  2. Ran BPE algorithm on that text
  3. Chose 128,256 as target vocab size
  4. Output: tokenizer.model file (the vocab)
  5. THEN started training the actual LLM

6. The Big Question: "English Didn't Change, So Why Do Vocabs Differ?"

Great intuition — but here's why vocabs do change between models:

Reason 1: Different Training Data

Code
LLaMA 2 (trained 2023):
  - Mostly English text, books, web
  - Less code
  - Result: 32,000 tokens, optimized for English

LLaMA 3 (trained 2024):
  - Much more code (Python, JS, etc.)
  - More multilingual (Hindi, Chinese, etc.)
  - Result: 128,256 tokens

New tokens added for code:
  "def", "self", "import", "    " (4 spaces = 1 token for indentation!)
  
New tokens added for other languages:
  "的", "है", "это" (common Chinese/Hindi/Russian pieces)

Reason 2: Different Vocab Sizes = Different Splits

Code
Small vocab (32,000 tokens):
  "embeddings" → ["embed", "dings"]     (2 tokens)
  "transformer" → ["trans", "former"]    (2 tokens)
  
Large vocab (128,000 tokens):
  "embeddings" → ["embeddings"]          (1 token!)  
  "transformer" → ["transformer"]        (1 token!)
  
Bigger vocab = fewer tokens per sentence = faster processing
But bigger vocab = larger embedding matrix = more memory

Reason 3: Different Optimization Goals

Code
GPT-4 (OpenAI):
  - Optimized for conversation, reasoning
  - Kept common English phrases as single tokens
  - " cannot" = 1 token (note the space is included!)

Code-specific models (CodeLlama):
  - Optimized for programming
  - "function", "return", "async" = single tokens
  - "    " (indentation) = single token

Reason 4: Efficiency Tradeoffs

Code
The tradeoff:
┌─────────────────────────────────────────────────┐
│  Smaller Vocab (32K)           Larger Vocab (128K) │
│  ─────────────────            ──────────────────── │
│  ✅ Less memory                ✅ Fewer tokens/sentence │
│  ✅ Smaller embedding table    ✅ Faster inference │
│  ❌ More tokens per sentence   ❌ Huge embedding table │
│  ❌ Slower inference           ❌ More memory │
└─────────────────────────────────────────────────┘

Summary Table

QuestionAnswer
What is an embedding?A vector of numbers that encodes meaning
How is it learned?By observing which words appear in similar contexts across billions of sentences
Names/rare words?Split into subword pieces; transformer combines them for meaning
Who makes the vocab?Model creators, using BPE on training data, before training starts
Does vocab change across models?Yes — different data, different sizes, different goals
Why change if English is the same?More languages, more code, efficiency tradeoffs, and different corpus compositions

The analogy:

English alphabet hasn't changed — but dictionaries differ (Oxford vs. Webster vs. Urban Dictionary). Each is optimized for different audiences. Same idea with LLM vocabularies — same language, different "dictionaries" optimized for different jobs.