vLLM is what you reach for when Ollama stops being enough. It’s also what you reach for too early — and that’s the expensive mistake.
vLLM is a high-performance inference engine built for production LLM serving. PagedAttention and continuous batching handle hundreds of concurrent users at low latency — the kind of load Ollama can’t touch.
What It Does
- PagedAttention for efficient GPU memory use
- Continuous batching — new requests join without waiting for the current batch to finish
- Multi-GPU tensor parallelism and distributed serving
- OpenAI-compatible API for drop-in migration from Ollama or cloud providers
Typical Costs
| Item | Monthly (24/7) |
|---|---|
| A100 40GB cloud | ~$860 |
| A100 80GB cloud | ~$1,145 |
| H100 80GB cloud | ~$1,080–$2,150 |
| Hidden: DevOps, model updates, networking | 3–5× raw GPU rental |
Self-hosting only beats managed APIs at very high volume — roughly 11 billion tokens/month, or when compliance mandates on-premise inference regardless of cost.
Where It Breaks
Setup Exhaustion
vLLM for a solo prototype is overkill. CUDA configuration, model shard downloads, tensor parallel settings — days of work for an audience of one. Use Ollama for development; switch to vLLM at production deploy.
100% GPU Utilization Crashes
Running at full GPU memory leaves no headroom for KV cache growth. Random OOMs hit at peak hours, not during testing. Set --gpu-memory-utilization 0.90 and treat the remaining 10% as mandatory buffer.
No Cloud Fallback
When your self-hosted endpoint dies at 2 AM — and it will — your product is down. Route through a gateway with automatic fallback to a cloud API during outages. The overflow API bill is cheaper than downtime.
When to Choose It
- 50+ concurrent users on a customer-facing AI feature
- API bills consistently exceed $10K/month and growing
- Strict data privacy mandate that blocks external APIs
- You have (or can hire) someone to maintain CUDA, model updates, and GPU infrastructure
Default for solo implementers: Managed APIs until the math is undeniable.
Related
- Ollama — Development and small-team serving
- Self-Hosted AI — Build vs buy framework
- TCO — Real cost numbers
- Infrastructure Layer — Hosting architecture
- Stack & Tools — Platform profiles
- LiteLLM — The gateway layer for cloud fallback when vLLM goes down
- Langfuse — Observability and cost tracking for vLLM production deployments
- Model Quantization — GPTQ and AWQ formats vLLM supports for GPU memory optimization