A full-precision Llama 3.1 70B model requires 140GB of GPU memory. That’s $40,000 of hardware minimum. Quantized to Q4, it fits in 40GB — a single A100 you can rent for $860/month. Same model. Different tradeoff.

Model quantization reduces the numerical precision of a model’s weights — the billions of numbers that define how the model reasons. Full precision (FP32) stores each weight as a 32-bit float. Quantization compresses those to 4, 5, 6, or 8 bits — smaller files, less memory, faster inference, with some degradation in output quality.

For most SMB deployments, the quality loss is acceptable. For legal, medical, or high-stakes reasoning, that’s not a safe assumption — test before deploying.

The Quantization Formats

GGUF (formerly GGML): The standard format for CPU and consumer GPU inference. Used by Ollama. Models distributed on Hugging Face in GGUF format run on laptops, Mac Mini M-series hardware, and NVIDIA consumer cards with no special setup.

GPTQ: Optimized for CUDA GPUs. Produces good results at 4-bit with minimal quality loss compared to FP16. Supported by vLLM for production inference servers.

AWQ (Activation-aware Weight Quantization): Generally produces better quality than GPTQ at the same bit width. Slower to quantize initially but better inference quality. Also supported by vLLM.

The Q-Level Tradeoff

Within GGUF models, the Q number indicates bit precision:

LevelBits/WeightVRAM for 8B ModelQuality vs Full
Q3_K_S3 bits~3.5 GBNoticeably worse
Q4_K_M4 bits~5 GBAcceptable for most tasks
Q5_K_M5 bits~6 GBClose to full quality
Q6_K6 bits~6.8 GBNear-identical to full
Q8_08 bits~9.5 GBEssentially full quality
F1616 bits~18 GBFull precision

For an 8B model (like Llama 3.1 8B):

  • Q4 fits in a consumer 6GB GPU (RTX 3060) or runs on CPU with 8GB RAM
  • Q8 needs 10GB — requires an RTX 3080 or above
  • F16 needs 18GB — an RTX 4090 (24GB) handles it, barely

For a 70B model:

  • Q4 needs ~40GB — one A100 80GB handles it comfortably
  • Q8 needs ~78GB — right at the limit of a single A100
  • F16 needs ~140GB — requires multiple GPUs

How Quantization Affects Quality

The quality loss from quantization is domain-dependent. Tasks where it matters most:

  • Complex multi-step mathematical reasoning
  • Code generation with subtle logic requirements
  • Legal or medical analysis requiring precise factual recall
  • Tasks that need the model to track many variables simultaneously

Tasks where quantization impact is minimal:

  • Conversational support and FAQ responses
  • Document summarization
  • Classification and routing
  • Simple extraction (dates, names, amounts from structured text)

For a customer support chatbot, Q4 is almost certainly fine. For a legal document analysis agent, run Q5 or Q8 and evaluate before deploying.

Practical VRAM Planning

A useful rule of thumb: the VRAM required for a quantized model is approximately (parameters × bits per weight) / 8 / 1024^3 GB, plus ~20% for KV cache overhead.

Examples:

  • Llama 3.1 8B Q4: (8B × 4) / 8 / 1024^3 ≈ 4GB + 20% = ~5GB
  • Mistral 7B Q5: (7B × 5) / 8 / 1024^3 ≈ 4.4GB + 20% = ~5.2GB
  • Llama 3.3 70B Q4: (70B × 4) / 8 / 1024^3 ≈ 35GB + 20% = ~42GB

Over-provisioning by 25% is standard practice — unexpected long-context requests spike the KV cache and cause OOM crashes at exactly the wrong moment.

The Refresh Cycle

Self-hosted model management means you own the quantization lifecycle. Every 6–8 weeks, newer models ship with better quality at the same parameter count. Staying current requires: download the new base weights (or a pre-quantized community release), run a quality regression against your golden eval set, and swap the model in Ollama or vLLM.

This isn’t zero-cost. Budget 3–4 hours every 6–8 weeks for model maintenance. Ignore it and your self-hosted deployment drifts while the managed API providers automatically improve.

Choosing the Right Quantization

For development and testing: Q4 is fine. It’s fast, cheap on memory, and the quality is sufficient for evaluating whether the model can handle your task.

For production with general tasks: Q5 or Q6. The quality difference from Q4 is noticeable in edge cases, and the memory premium is small.

For production with high-stakes tasks: Q8 or full precision. Don’t optimize for cost here. Budget for the hardware.

Default: start with Q4_K_M for evaluation, switch to Q5_K_M or Q6_K for production if your hardware supports it.

  • Ollama — Runs GGUF models; downloads pre-quantized community releases automatically
  • vLLM — Production inference with GPTQ/AWQ support for GPU servers
  • Self-Hosted AI — The broader framework for when self-hosting makes sense
  • TCO — The hardware cost calculation quantization affects
  • Infrastructure Layer — Where quantization decisions live in deployment planning