Model Quantization

A full-precision Llama 3.1 70B model requires 140GB of GPU memory. That’s $40,000 of hardware minimum. Quantized to Q4, it fits in 40GB — a single A100 you can rent for $860/month. Same model. Different tradeoff.

Model quantization reduces the numerical precision of a model’s weights — the billions of numbers that define how the model reasons. Full precision (FP32) stores each weight as a 32-bit float. Quantization compresses those to 4, 5, 6, or 8 bits — smaller files, less memory, faster inference, with some degradation in output quality.

For most SMB deployments, the quality loss is acceptable. For legal, medical, or high-stakes reasoning, that’s not a safe assumption — test before deploying.

The Quantization Formats

GGUF (formerly GGML): The standard format for CPU and consumer GPU inference. Used by Ollama. Models distributed on Hugging Face in GGUF format run on laptops, Mac Mini M-series hardware, and NVIDIA consumer cards with no special setup.

GPTQ: Optimized for CUDA GPUs. Produces good results at 4-bit with minimal quality loss compared to FP16. Supported by vLLM for production inference servers.

AWQ (Activation-aware Weight Quantization): Generally produces better quality than GPTQ at the same bit width. Slower to quantize initially but better inference quality. Also supported by vLLM.

The Q-Level Tradeoff

Within GGUF models, the Q number indicates bit precision:

Level	Bits/Weight	VRAM for 8B Model	Quality vs Full
Q3_K_S	3 bits	~3.5 GB	Noticeably worse
Q4_K_M	4 bits	~5 GB	Acceptable for most tasks
Q5_K_M	5 bits	~6 GB	Close to full quality
Q6_K	6 bits	~6.8 GB	Near-identical to full
Q8_0	8 bits	~9.5 GB	Essentially full quality
F16	16 bits	~18 GB	Full precision

For an 8B model (like Llama 3.1 8B):

Q4 fits in a consumer 6GB GPU (RTX 3060) or runs on CPU with 8GB RAM
Q8 needs 10GB — requires an RTX 3080 or above
F16 needs 18GB — an RTX 4090 (24GB) handles it, barely

For a 70B model:

Q4 needs ~40GB — one A100 80GB handles it comfortably
Q8 needs ~78GB — right at the limit of a single A100
F16 needs ~140GB — requires multiple GPUs

How Quantization Affects Quality

The quality loss from quantization is domain-dependent. Tasks where it matters most:

Complex multi-step mathematical reasoning
Code generation with subtle logic requirements
Legal or medical analysis requiring precise factual recall
Tasks that need the model to track many variables simultaneously

Tasks where quantization impact is minimal:

Conversational support and FAQ responses
Document summarization
Classification and routing
Simple extraction (dates, names, amounts from structured text)

For a customer support chatbot, Q4 is almost certainly fine. For a legal document analysis agent, run Q5 or Q8 and evaluate before deploying.

Practical VRAM Planning

A useful rule of thumb: the VRAM required for a quantized model is approximately (parameters × bits per weight) / 8 / 1024^3 GB, plus ~20% for KV cache overhead.

Examples:

Llama 3.1 8B Q4: (8B × 4) / 8 / 1024^3 ≈ 4GB + 20% = ~5GB
Mistral 7B Q5: (7B × 5) / 8 / 1024^3 ≈ 4.4GB + 20% = ~5.2GB
Llama 3.3 70B Q4: (70B × 4) / 8 / 1024^3 ≈ 35GB + 20% = ~42GB

Over-provisioning by 25% is standard practice — unexpected long-context requests spike the KV cache and cause OOM crashes at exactly the wrong moment.

The Refresh Cycle

Self-hosted model management means you own the quantization lifecycle. Every 6–8 weeks, newer models ship with better quality at the same parameter count. Staying current requires: download the new base weights (or a pre-quantized community release), run a quality regression against your golden eval set, and swap the model in Ollama or vLLM.

This isn’t zero-cost. Budget 3–4 hours every 6–8 weeks for model maintenance. Ignore it and your self-hosted deployment drifts while the managed API providers automatically improve.

Choosing the Right Quantization

For development and testing: Q4 is fine. It’s fast, cheap on memory, and the quality is sufficient for evaluating whether the model can handle your task.

For production with general tasks: Q5 or Q6. The quality difference from Q4 is noticeable in edge cases, and the memory premium is small.

For production with high-stakes tasks: Q8 or full precision. Don’t optimize for cost here. Budget for the hardware.

Default: start with Q4_K_M for evaluation, switch to Q5_K_M or Q6_K for production if your hardware supports it.

Ollama — Runs GGUF models; downloads pre-quantized community releases automatically
vLLM — Production inference with GPTQ/AWQ support for GPU servers
Self-Hosted AI — The broader framework for when self-hosting makes sense
TCO — The hardware cost calculation quantization affects
Infrastructure Layer — Where quantization decisions live in deployment planning

WyrdWerk Deployment Wiki

Explorer

Model Quantization

The Quantization Formats

The Q-Level Tradeoff

How Quantization Affects Quality

Practical VRAM Planning

The Refresh Cycle

Choosing the Right Quantization

Graph View

Table of Contents

Backlinks

WyrdWerk Deployment Wiki

Explorer

Model Quantization

The Quantization Formats

The Q-Level Tradeoff

How Quantization Affects Quality

Practical VRAM Planning

The Refresh Cycle

Choosing the Right Quantization

Related

Graph View

Table of Contents

Backlinks