A transformer is a neural network architecture that reads all tokens at once (not one by one like older models) and figures out which tokens should pay attention to which other tokens to understand meaning.
The Analogy
Imagine you're reading this sentence:
Code
"The bank was steep, so I didn't jump into the river"
When you hit the word "bank," your brain looks back at "river" to decide it means "riverbank" not "financial bank." That's exactly what a transformer does — but mathematically.
The Architecture (Simplified)
Code
Input: "Customer wants to cancel service"
│
▼
┌─────────────────────────┐
│ 1. TOKENIZATION │ → [12043, 6592, 311, 12074, 2532]
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ 2. EMBEDDING LAYER │ → Each ID becomes a 4096-dim vector
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ 3. POSITIONAL ENCODING │ → Adds "I'm the 1st/2nd/3rd word" info
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ 4. TRANSFORMER BLOCKS │ ← This is the magic (repeated 32-96 times)
│ ┌─────────────────┐ │
│ │ Self-Attention │ │ → "Which other words matter for THIS word?"
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Feed-Forward NN │ │ → "Now process what I learned"
│ └─────────────────┘ │
│ (repeat N times) │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ 5. OUTPUT HEAD │ → Predicts the next token (probability over vocab)
└─────────────────────────┘
│
▼
Output: "cancellation" (most probable next token)
How Attention Works (The Key Innovation)
Code
Sentence: "The cat sat on the mat because it was tired"
Question the model asks: What does "it" refer to?
Attention scores for "it":
"The" → 0.02 (low)
"cat" → 0.71 (HIGH — "it" = the cat!)
"sat" → 0.04 (low)
"on" → 0.01 (low)
"the" → 0.01 (low)
"mat" → 0.18 (some attention — could be the mat)
"because" → 0.03 (low)
The model learns to "attend" to the right words to build understanding.
Multi-Head Attention — Why "Multi"?
One attention head might focus on grammar (subject-verb agreement). Another head might focus on meaning (what does "it" refer to?). Another might focus on position (what's nearby?).
Code
Head 1 (grammar): "it" attends to "cat" (noun agreement)
Head 2 (coreference): "it" attends to "cat" (pronoun resolution)
Head 3 (proximity): "it" attends to "was" (adjacent word)
All heads combined → rich understanding of each token's role
A model with 32 heads runs 32 attention patterns in parallel, then combines them.
Why Repeat Layers?
Code
Layer 1-4: Low-level patterns (grammar, syntax, word boundaries)
Layer 5-16: Mid-level patterns (phrases, entities, relationships)
Layer 17-32: High-level patterns (reasoning, tone, intent, world knowledge)
Early layers: "cancel" is a verb, "service" is a noun
Middle layers: "cancel service" is a customer action
Deep layers: "This person is unhappy and wants to churn"
Each layer refines the understanding built by the previous layer.
Summary of Each Component (Few Lines)
Layer
What It Does
Tokenizer
Splits text into subword pieces, assigns IDs
Embedding
Converts IDs into meaning-vectors
Positional Encoding
Tells the model word ORDER (since attention has no inherent sequence)
Self-Attention
Each token asks "which other tokens are relevant to me?" and pulls info from them
Feed-Forward Network
Processes the attention output — adds non-linearity, stores factual knowledge
Layer Norm
Stabilizes numbers between layers so training doesn't explode
Residual Connections
Adds the input back to the output of each block (prevents "forgetting" earlier info)
Output Head (LM Head)
Maps final vectors back to vocab-sized probabilities → picks the next token
One-liner: A transformer is a stack of attention layers that let every word look at every other word to figure out meaning, repeated dozens of times until the model deeply "understands" the input and can predict what comes next
An embedding is a list of numbers that represents the meaning of a token. Think of it as GPS coordinates, but instead of 2 dimensions (latitude, longitude), you have thousands of dimensions.
Code
Real GPS (2 dimensions):
New York → [40.71, -74.00]
New Jersey → [40.05, -74.40] ← nearby = geographically similar
Tokyo → [35.68, 139.69] ← far away
Token Embedding (4096 dimensions):
"cancel" → [0.82, -0.14, 0.91, 0.03, -0.67, ... 4096 numbers]
"terminate" → [0.79, -0.11, 0.88, 0.05, -0.64, ... ] ← nearby = similar meaning
"pizza" → [-0.45, 0.72, -0.33, 0.64, 0.21, ... ] ← far away = unrelated
Each of those 4096 numbers captures some aspect of meaning — we don't always know what each dimension means, but collectively they encode:
Is this a verb or noun?
Is it positive or negative?
Is it formal or casual?
Is it about food? technology? emotions?
2. How Are Embeddings Learned?
Start: Completely Random
Code
Before training (Day 0):
"king" → [0.52, 0.11, -0.83, 0.44, ...] ← random garbage
"queen" → [-0.31, 0.67, 0.02, -0.91, ...] ← random garbage
"pizza" → [0.19, -0.55, 0.73, 0.28, ...] ← random garbage
All words are equally "far apart" — no meaning yet.
Training: Learn from Context
The model reads trillions of sentences and learns:
Code
Sentence 1: "The king ruled the kingdom"
Sentence 2: "The queen ruled the kingdom"
Sentence 3: "I ordered a pizza"
Model notices:
- "king" and "queen" appear in SAME contexts
(before "ruled", after "the", with "kingdom")
- "pizza" appears in COMPLETELY DIFFERENT contexts
(after "ordered", with "toppings", "delivery")
So training pushes:
- "king" vector TOWARD "queen" vector (similar contexts)
- "pizza" vector AWAY from both (different contexts)
After Billions of Sentences:
Code
After training:
"king" → [0.83, -0.21, 0.67, 0.45, ...]
"queen" → [0.81, -0.18, 0.64, 0.42, ...] ← very close!
"prince" → [0.78, -0.25, 0.61, 0.39, ...] ← also close!
"pizza" → [-0.44, 0.71, -0.28, 0.63, ...] ← far away
The famous example:
Code
king - man + woman ≈ queen (vector arithmetic actually works!)
3. How Does It Store Names, Proper Nouns, Standard Words?
Standard Words (common):
Code
"the" → Single token, ID 279
Embedding learned from trillions of occurrences
Vector captures: "article, neutral, precedes nouns"
"customer" → Single token, ID 12043
Vector captures: "person, business context, receives service"
Names and Rare Words (get split):
Code
"Comcast" → ["Com", "cast"] → IDs [1568, 4384]
Each piece gets its OWN embedding vector
The transformer layers COMBINE them to understand
"this refers to the company Comcast"
"Teradata" → ["Ter", "adata"] → IDs [7321, 18294]
"John" → ["John"] → ID [2782] (common name = single token)
"Kuznetsov" → ["K", "uz", "nets", "ov"] → IDs [42, 3712, 17843, 869]
Key insight: For split words, the meaning isn't in any single token's embedding — it emerges when the transformer layers process them together with attention.
Code
"Com" alone could mean: company, combination, communication, comedy...
"cast" alone could mean: throw, actors, plaster cast...
But "Com" + "cast" in context → the model figures out "it's the telecom company"
(This is the transformer's job, not the embedding's job alone)
4. Embedding Sizes Across Models
Model
Vocab Size
Embedding Dimensions
Total Parameters (embedding layer alone)
GPT-2
50,257
768
~38 million
GPT-3
50,257
12,288
~617 million
GPT-4
~100,000
~8,192–12,288 (estimated)
~800M–1.2B
LLaMA 2
32,000
4,096
~131 million
LLaMA 3
128,256
4,096 (8B) / 8,192 (70B)
~525M / ~1B
Mistral 7B
32,000
4,096
~131 million
BERT (base)
30,522
768
~23 million
Why bigger models have larger embeddings: More dimensions = more nuance. Like describing a person with 5 adjectives vs. 4,096 adjectives — the more dimensions, the more precisely you can differentiate meanings.
5. Who Prepares the Initial Vocabulary and How?
The vocabulary is built before training using an algorithm called BPE (Byte Pair Encoding). Here's how:
Step-by-Step BPE Example:
Code
Training corpus (simplified):
"low low low low low"
"lower lower"
"newest newest newest"
"widest"
Step 1: Start with individual characters as vocab
Vocab: {l, o, w, e, r, n, s, t, d, i, ...}
Step 2: Count which pairs of characters appear most often
"lo" appears 7 times (5 in "low" + 2 in "lower")
"ow" appears 7 times
"ne" appears 3 times
"we" appears 3 times
...
Step 3: Merge the most frequent pair → "lo" becomes one token
Vocab: {l, o, w, e, r, n, s, t, d, i, ..., "lo"}
Step 4: Recount pairs with the merged token
"lo"+"w" appears 7 times → merge!
Vocab: {..., "lo", "low"}
Step 5: Keep merging until you reach desired vocab size
"low" + "er" → "lower"
"new" + "est" → "newest"
...
Final vocab (after 128,000 merges):
Single characters: a, b, c, ...
Common pieces: "ing", "tion", "un", "re", ...
Full common words: "the", "and", "customer", ...
Who does this?
The model creators (Meta, OpenAI, Google) run BPE on their training corpus once before training starts. It's a preprocessing step.
Code
Meta (LLaMA 3):
1. Collected training text (web, books, code...)
2. Ran BPE algorithm on that text
3. Chose 128,256 as target vocab size
4. Output: tokenizer.model file (the vocab)
5. THEN started training the actual LLM
6. The Big Question: "English Didn't Change, So Why Do Vocabs Differ?"
Great intuition — but here's why vocabs do change between models:
Reason 1: Different Training Data
Code
LLaMA 2 (trained 2023):
- Mostly English text, books, web
- Less code
- Result: 32,000 tokens, optimized for English
LLaMA 3 (trained 2024):
- Much more code (Python, JS, etc.)
- More multilingual (Hindi, Chinese, etc.)
- Result: 128,256 tokens
New tokens added for code:
"def", "self", "import", " " (4 spaces = 1 token for indentation!)
New tokens added for other languages:
"的", "है", "это" (common Chinese/Hindi/Russian pieces)
Reason 2: Different Vocab Sizes = Different Splits
Code
Small vocab (32,000 tokens):
"embeddings" → ["embed", "dings"] (2 tokens)
"transformer" → ["trans", "former"] (2 tokens)
Large vocab (128,000 tokens):
"embeddings" → ["embeddings"] (1 token!)
"transformer" → ["transformer"] (1 token!)
Bigger vocab = fewer tokens per sentence = faster processing
But bigger vocab = larger embedding matrix = more memory
Reason 3: Different Optimization Goals
Code
GPT-4 (OpenAI):
- Optimized for conversation, reasoning
- Kept common English phrases as single tokens
- " cannot" = 1 token (note the space is included!)
Code-specific models (CodeLlama):
- Optimized for programming
- "function", "return", "async" = single tokens
- " " (indentation) = single token
By observing which words appear in similar contexts across billions of sentences
Names/rare words?
Split into subword pieces; transformer combines them for meaning
Who makes the vocab?
Model creators, using BPE on training data, before training starts
Does vocab change across models?
Yes — different data, different sizes, different goals
Why change if English is the same?
More languages, more code, efficiency tradeoffs, and different corpus compositions
The analogy:
English alphabet hasn't changed — but dictionaries differ (Oxford vs. Webster vs. Urban Dictionary). Each is optimized for different audiences. Same idea with LLM vocabularies — same language, different "dictionaries" optimized for different jobs.