Thursday, May 28, 2026

LLM quantization explanation

 

Simplest Possible Explanation of 4-bit Quantization

You Download a Model File (FP16)

The model has weights. Let's say one small block of 8 weights (using 8 instead of 64 for simplicity):

Code
Original FP16 weights (2 bytes each):
  [0.8, 0.3, -0.5, 1.2, -1.0, 0.1, 0.6, -0.9]

Memory: 8 weights × 2 bytes = 16 bytes

"Loading in 4-bit" = Compressing on the Fly

When you do load_in_4bit=True, this happens:

Step 1: Find the biggest value in the block

Code
absmax = 1.2 (the largest absolute value)

Step 2: Divide everything by absmax → normalize to [-1, 1]

Code
[0.8/1.2, 0.3/1.2, -0.5/1.2, 1.2/1.2, -1.0/1.2, 0.1/1.2, 0.6/1.2, -0.9/1.2]
= [0.67,   0.25,    -0.42,    1.0,     -0.83,    0.08,    0.5,     -0.75]

Step 3: Snap each to nearest of 16 allowed levels

Code
4 bits = 16 possible values. Think of it as 16 "bins":

-1.0  -0.7  -0.5  -0.4  -0.3  -0.2  -0.1  0.0  0.1  0.2  0.3  0.4  0.5  0.7  1.0
  0     1     2     3     4     5     6    7    8    9   10   11   12   13   14   15

 0.67 → snaps to  0.7 (index 13)
 0.25 → snaps to  0.3 (index 10)
-0.42 → snaps to -0.4 (index 3)
 1.0  → snaps to  1.0 (index 15)
-0.83 → snaps to -0.7 (index 1)
 0.08 → snaps to  0.1 (index 8)
 0.5  → snaps to  0.5 (index 12)
-0.75 → snaps to -0.7 (index 1)

Step 4: Store

Code
Stored in memory:
  Indices:  [13, 10, 3, 15, 1, 8, 12, 1]  ← 8 × 4 bits = 4 bytes
  Scale:    1.2                             ← 1 × 2 bytes (FP16)
                                            ─────────────
                                            Total: 6 bytes (was 16!)

When the Model Runs (Forward Pass), Decompress

Code
Retrieve index → look up NF4 level → multiply by scale:

  index 13 → 0.7  × 1.2 = 0.84    (original was 0.8)
  index 10 → 0.3  × 1.2 = 0.36    (original was 0.3) ✓ 
  index 3  → -0.4 × 1.2 = -0.48   (original was -0.5)
  index 15 → 1.0  × 1.2 = 1.2     (original was 1.2) ✓ exact!
  index 1  → -0.7 × 1.2 = -0.84   (original was -1.0) ← worst error
  index 8  → 0.1  × 1.2 = 0.12    (original was 0.1)
  index 12 → 0.5  × 1.2 = 0.6     (original was 0.6) ✓ exact!
  index 1  → -0.7 × 1.2 = -0.84   (original was -0.9)

Side by Side

Code
Original FP16:    [0.80,  0.30, -0.50,  1.20, -1.00,  0.10,  0.60, -0.90]
After 4-bit:      [0.84,  0.36, -0.48,  1.20, -0.84,  0.12,  0.60, -0.84]
Error:            [+0.04,+0.06,+0.02,  0.00, +0.16, +0.02,  0.00, +0.06]

Memory: 16 bytes → 6 bytes (3.7× smaller)

That's It. The Whole Process Is:

Code
┌─────────────────────────────────────────────┐
│                                             │
│  LOADING:                                   │
│    FP16 weights                             │
│        ↓  ÷ absmax                          │
│    Normalized [-1, 1]                       │
│        ↓  snap to nearest bin               │
│    4-bit index + scale stored in GPU        │
│                                             │
│  RUNNING (every forward pass):              │
│    4-bit index                              │
│        ↓  look up bin value                 │
│    NF4 level                                │
│        ↓  × scale                           │
│    Approximate FP16 value → used in matmul  │
│                                             │
└─────────────────────────────────────────────┘

Why It Works

Code
Original weights:  [0.80,  0.30, -0.50,  1.20, -1.00,  0.10,  0.60, -0.90]
                    ↓ matrix multiply with input
Original output:   42.5

4-bit weights:     [0.84,  0.36, -0.48,  1.20, -0.84,  0.12,  0.60, -0.84]
                    ↓ matrix multiply with same input
4-bit output:      42.3

Close enough! The model still picks the same next token.

The small errors in individual weights average out across thousands of multiplications.

No comments:

Post a Comment