┌─────────────────────────────────────────────┐
│ │
│ LOADING: │
│ FP16 weights │
│ ↓ ÷ absmax │
│ Normalized [-1, 1] │
│ ↓ snap to nearest bin │
│ 4-bit index + scale stored in GPU │
│ │
│ RUNNING (every forward pass): │
│ 4-bit index │
│ ↓ look up bin value │
│ NF4 level │
│ ↓ × scale │
│ Approximate FP16 value → used in matmul │
│ │
└─────────────────────────────────────────────┘
Why It Works
Code
Original weights: [0.80, 0.30, -0.50, 1.20, -1.00, 0.10, 0.60, -0.90]
↓ matrix multiply with input
Original output: 42.5
4-bit weights: [0.84, 0.36, -0.48, 1.20, -0.84, 0.12, 0.60, -0.84]
↓ matrix multiply with same input
4-bit output: 42.3
Close enough! The model still picks the same next token.
The small errors in individual weights average out across thousands of multiplications.