vLLM

vLLM is what you reach for when Ollama stops being enough. It’s also what you reach for too early — and that’s the expensive mistake.

vLLM is a high-performance inference engine built for production LLM serving. PagedAttention and continuous batching handle hundreds of concurrent users at low latency — the kind of load Ollama can’t touch.

What It Does

PagedAttention for efficient GPU memory use
Continuous batching — new requests join without waiting for the current batch to finish
Multi-GPU tensor parallelism and distributed serving
OpenAI-compatible API for drop-in migration from Ollama or cloud providers

Typical Costs

Item	Monthly (24/7)
A100 40GB cloud	~$860
A100 80GB cloud	~$1,145
H100 80GB cloud	~$1,080–$2,150
Hidden: DevOps, model updates, networking	3–5× raw GPU rental

Self-hosting only beats managed APIs at very high volume — roughly 11 billion tokens/month, or when compliance mandates on-premise inference regardless of cost.

Where It Breaks

Setup Exhaustion

vLLM for a solo prototype is overkill. CUDA configuration, model shard downloads, tensor parallel settings — days of work for an audience of one. Use Ollama for development; switch to vLLM at production deploy.

100% GPU Utilization Crashes

Running at full GPU memory leaves no headroom for KV cache growth. Random OOMs hit at peak hours, not during testing. Set --gpu-memory-utilization 0.90 and treat the remaining 10% as mandatory buffer.

No Cloud Fallback

When your self-hosted endpoint dies at 2 AM — and it will — your product is down. Route through a gateway with automatic fallback to a cloud API during outages. The overflow API bill is cheaper than downtime.

When to Choose It

50+ concurrent users on a customer-facing AI feature
API bills consistently exceed $10K/month and growing
Strict data privacy mandate that blocks external APIs
You have (or can hire) someone to maintain CUDA, model updates, and GPU infrastructure

Default for solo implementers: Managed APIs until the math is undeniable.

Ollama — Development and small-team serving
Self-Hosted AI — Build vs buy framework
TCO — Real cost numbers
Infrastructure Layer — Hosting architecture
Stack & Tools — Platform profiles
LiteLLM — The gateway layer for cloud fallback when vLLM goes down
Langfuse — Observability and cost tracking for vLLM production deployments
Model Quantization — GPTQ and AWQ formats vLLM supports for GPU memory optimization

WyrdWerk Deployment Wiki

Explorer

vLLM

What It Does

Typical Costs

Where It Breaks

Setup Exhaustion

100% GPU Utilization Crashes

No Cloud Fallback

When to Choose It

Graph View

Table of Contents

Backlinks

WyrdWerk Deployment Wiki

Explorer

vLLM

What It Does

Typical Costs

Where It Breaks

Setup Exhaustion

100% GPU Utilization Crashes

No Cloud Fallback

When to Choose It

Related

Graph View

Table of Contents

Backlinks