A full-precision Llama 3.1 70B model requires 140GB of GPU memory. That’s $40,000 of hardware minimum. Quantized to Q4, it fits in 40GB — a single A100 you can rent for $860/month. Same model. Different tradeoff.
Model quantization reduces the numerical precision of a model’s weights — the billions of numbers that define how the model reasons. Full precision (FP32) stores each weight as a 32-bit float. Quantization compresses those to 4, 5, 6, or 8 bits — smaller files, less memory, faster inference, with some degradation in output quality.
For most SMB deployments, the quality loss is acceptable. For legal, medical, or high-stakes reasoning, that’s not a safe assumption — test before deploying.
The Quantization Formats
GGUF (formerly GGML): The standard format for CPU and consumer GPU inference. Used by Ollama. Models distributed on Hugging Face in GGUF format run on laptops, Mac Mini M-series hardware, and NVIDIA consumer cards with no special setup.
GPTQ: Optimized for CUDA GPUs. Produces good results at 4-bit with minimal quality loss compared to FP16. Supported by vLLM for production inference servers.
AWQ (Activation-aware Weight Quantization): Generally produces better quality than GPTQ at the same bit width. Slower to quantize initially but better inference quality. Also supported by vLLM.
The Q-Level Tradeoff
Within GGUF models, the Q number indicates bit precision:
| Level | Bits/Weight | VRAM for 8B Model | Quality vs Full |
|---|---|---|---|
| Q3_K_S | 3 bits | ~3.5 GB | Noticeably worse |
| Q4_K_M | 4 bits | ~5 GB | Acceptable for most tasks |
| Q5_K_M | 5 bits | ~6 GB | Close to full quality |
| Q6_K | 6 bits | ~6.8 GB | Near-identical to full |
| Q8_0 | 8 bits | ~9.5 GB | Essentially full quality |
| F16 | 16 bits | ~18 GB | Full precision |
For an 8B model (like Llama 3.1 8B):
- Q4 fits in a consumer 6GB GPU (RTX 3060) or runs on CPU with 8GB RAM
- Q8 needs 10GB — requires an RTX 3080 or above
- F16 needs 18GB — an RTX 4090 (24GB) handles it, barely
For a 70B model:
- Q4 needs ~40GB — one A100 80GB handles it comfortably
- Q8 needs ~78GB — right at the limit of a single A100
- F16 needs ~140GB — requires multiple GPUs
How Quantization Affects Quality
The quality loss from quantization is domain-dependent. Tasks where it matters most:
- Complex multi-step mathematical reasoning
- Code generation with subtle logic requirements
- Legal or medical analysis requiring precise factual recall
- Tasks that need the model to track many variables simultaneously
Tasks where quantization impact is minimal:
- Conversational support and FAQ responses
- Document summarization
- Classification and routing
- Simple extraction (dates, names, amounts from structured text)
For a customer support chatbot, Q4 is almost certainly fine. For a legal document analysis agent, run Q5 or Q8 and evaluate before deploying.
Practical VRAM Planning
A useful rule of thumb: the VRAM required for a quantized model is approximately (parameters × bits per weight) / 8 / 1024^3 GB, plus ~20% for KV cache overhead.
Examples:
- Llama 3.1 8B Q4: (8B × 4) / 8 / 1024^3 ≈ 4GB + 20% = ~5GB
- Mistral 7B Q5: (7B × 5) / 8 / 1024^3 ≈ 4.4GB + 20% = ~5.2GB
- Llama 3.3 70B Q4: (70B × 4) / 8 / 1024^3 ≈ 35GB + 20% = ~42GB
Over-provisioning by 25% is standard practice — unexpected long-context requests spike the KV cache and cause OOM crashes at exactly the wrong moment.
The Refresh Cycle
Self-hosted model management means you own the quantization lifecycle. Every 6–8 weeks, newer models ship with better quality at the same parameter count. Staying current requires: download the new base weights (or a pre-quantized community release), run a quality regression against your golden eval set, and swap the model in Ollama or vLLM.
This isn’t zero-cost. Budget 3–4 hours every 6–8 weeks for model maintenance. Ignore it and your self-hosted deployment drifts while the managed API providers automatically improve.
Choosing the Right Quantization
For development and testing: Q4 is fine. It’s fast, cheap on memory, and the quality is sufficient for evaluating whether the model can handle your task.
For production with general tasks: Q5 or Q6. The quality difference from Q4 is noticeable in edge cases, and the memory premium is small.
For production with high-stakes tasks: Q8 or full precision. Don’t optimize for cost here. Budget for the hardware.
Default: start with Q4_K_M for evaluation, switch to Q5_K_M or Q6_K for production if your hardware supports it.
Related
- Ollama — Runs GGUF models; downloads pre-quantized community releases automatically
- vLLM — Production inference with GPTQ/AWQ support for GPU servers
- Self-Hosted AI — The broader framework for when self-hosting makes sense
- TCO — The hardware cost calculation quantization affects
- Infrastructure Layer — Where quantization decisions live in deployment planning