Simplest Possible Explanation of 4-bit Quantization

You Download a Model File (FP16)

The model has weights. Let's say one small block of 8 weights (using 8 instead of 64 for simplicity):

Code

Original FP16 weights (2 bytes each):
  [0.8, 0.3, -0.5, 1.2, -1.0, 0.1, 0.6, -0.9]

Memory: 8 weights × 2 bytes = 16 bytes

"Loading in 4-bit" = Compressing on the Fly

When you do load_in_4bit=True, this happens:

Step 1: Find the biggest value in the block

Code

absmax = 1.2 (the largest absolute value)

Step 2: Divide everything by absmax → normalize to [-1, 1]

Code

[0.8/1.2, 0.3/1.2, -0.5/1.2, 1.2/1.2, -1.0/1.2, 0.1/1.2, 0.6/1.2, -0.9/1.2]
= [0.67,   0.25,    -0.42,    1.0,     -0.83,    0.08,    0.5,     -0.75]

Step 3: Snap each to nearest of 16 allowed levels

Code

4 bits = 16 possible values. Think of it as 16 "bins":

-1.0  -0.7  -0.5  -0.4  -0.3  -0.2  -0.1  0.0  0.1  0.2  0.3  0.4  0.5  0.7  1.0
  0     1     2     3     4     5     6    7    8    9   10   11   12   13   14   15

 0.67 → snaps to  0.7 (index 13)
 0.25 → snaps to  0.3 (index 10)
-0.42 → snaps to -0.4 (index 3)
 1.0  → snaps to  1.0 (index 15)
-0.83 → snaps to -0.7 (index 1)
 0.08 → snaps to  0.1 (index 8)
 0.5  → snaps to  0.5 (index 12)
-0.75 → snaps to -0.7 (index 1)

Step 4: Store

Code

Stored in memory:
  Indices:  [13, 10, 3, 15, 1, 8, 12, 1]  ← 8 × 4 bits = 4 bytes
  Scale:    1.2                             ← 1 × 2 bytes (FP16)
                                            ─────────────
                                            Total: 6 bytes (was 16!)

When the Model Runs (Forward Pass), Decompress

Code

Retrieve index → look up NF4 level → multiply by scale:

  index 13 → 0.7  × 1.2 = 0.84    (original was 0.8)
  index 10 → 0.3  × 1.2 = 0.36    (original was 0.3) ✓ 
  index 3  → -0.4 × 1.2 = -0.48   (original was -0.5)
  index 15 → 1.0  × 1.2 = 1.2     (original was 1.2) ✓ exact!
  index 1  → -0.7 × 1.2 = -0.84   (original was -1.0) ← worst error
  index 8  → 0.1  × 1.2 = 0.12    (original was 0.1)
  index 12 → 0.5  × 1.2 = 0.6     (original was 0.6) ✓ exact!
  index 1  → -0.7 × 1.2 = -0.84   (original was -0.9)

Side by Side

Code

Original FP16:    [0.80,  0.30, -0.50,  1.20, -1.00,  0.10,  0.60, -0.90]
After 4-bit:      [0.84,  0.36, -0.48,  1.20, -0.84,  0.12,  0.60, -0.84]
Error:            [+0.04,+0.06,+0.02,  0.00, +0.16, +0.02,  0.00, +0.06]

Memory: 16 bytes → 6 bytes (3.7× smaller)

That's It. The Whole Process Is:

Code

┌─────────────────────────────────────────────┐
│                                             │
│  LOADING:                                   │
│    FP16 weights                             │
│        ↓  ÷ absmax                          │
│    Normalized [-1, 1]                       │
│        ↓  snap to nearest bin               │
│    4-bit index + scale stored in GPU        │
│                                             │
│  RUNNING (every forward pass):              │
│    4-bit index                              │
│        ↓  look up bin value                 │
│    NF4 level                                │
│        ↓  × scale                           │
│    Approximate FP16 value → used in matmul  │
│                                             │
└─────────────────────────────────────────────┘

Why It Works

Code

Original weights:  [0.80,  0.30, -0.50,  1.20, -1.00,  0.10,  0.60, -0.90]
                    ↓ matrix multiply with input
Original output:   42.5

4-bit weights:     [0.84,  0.36, -0.48,  1.20, -0.84,  0.12,  0.60, -0.84]
                    ↓ matrix multiply with same input
4-bit output:      42.3

Close enough! The model still picks the same next token.

The small errors in individual weights average out across thousands of multiplications.

Bigdata and data science by Kartheek Dachepalli

Thursday, May 28, 2026

LLM quantization explanation

Simplest Possible Explanation of 4-bit Quantization

You Download a Model File (FP16)

"Loading in 4-bit" = Compressing on the Fly

Step 1: Find the biggest value in the block

Step 2: Divide everything by absmax → normalize to [-1, 1]

Step 3: Snap each to nearest of 16 allowed levels

Step 4: Store

When the Model Runs (Forward Pass), Decompress

Side by Side

That's It. The Whole Process Is:

Why It Works

No comments:

Post a Comment