Quantization Engineering: The 4-bit Frontier
How do you fit a 140GB model into a 24GB GPU? Through the magic—and math—of quantization.
What is Quantization?
Quantization is the process of reducing the precision of the model's weights. Instead of using 16-bit floating point numbers (FP16), we can often use 8-bit, 4-bit, or even 2-bit representations with surprisingly little loss in accuracy (perplexity).
The Major Formats
GGUF (formerly GGML)
The king of CPU + GPU inference. Created by the llama.cpp team, it allows for "sharding" models across system RAM and VRAM.
AWQ (Activation-aware Weight Quantization)
Protects the "salient" weights that contribute most to accuracy. Ideal for multi-GPU production inference via vLLM.
EXL2 (ExLlamaV2)
Optimized specifically for NVIDIA GPUs. Offers the fastest tokens-per-second for 4-bit and 6-bit quantization.
Calculating VRAM Requirements
A quick rule of thumb for 4-bit quantization:
Example: 70B model at 4-bit ≈ 49GB VRAM + KV Cache
When to use what?
- Home Server / Mac: Stick with **GGUF**. It's the most flexible and widely supported.
- Cloud Production (NVIDIA): Use **AWQ**. It integrates seamlessly with vLLM and TensorRT.
- Extreme Speed (Single GPU): Go with **EXL2**.