Your self-hosted vLLM endpoint goes down at 2 AM. Without a gateway, your product is down. With LiteLLM, your traffic automatically reroutes to a cloud API. The overflow bill is $12. The alternative is a downtime incident.
LiteLLM is an open-source library and proxy server that gives you a single, unified interface to 100+ LLM providers — OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Ollama, vLLM, and more. Every response comes back in the OpenAI Chat Completions format regardless of which provider handled it. Your application code stops caring which model is actually running.
What It Does
- Unified API: One
completion()call works across every supported provider. No rewriting app code when you switch models. - Automatic fallback and load balancing: Routes to a backup endpoint on failure. Distributes traffic across deployments. Configurable retry logic with exponential backoff.
- Virtual keys and budgets: Per-team, per-user, per-project API keys with hard spend limits. A retry loop can’t silently generate a $3,000 bill when there’s a $50 cap on the key.
- Cost tracking: Token usage and cost per request, per key, per team, across all providers in one place.
- Observability integration: Native integrations with Langfuse, MLflow, Helicone via a single callback line.
Deployment Options
Python SDK — Drop-in replacement for the OpenAI Python client. If your code calls openai.chat.completions.create(), you change three lines and gain access to every provider LiteLLM supports.
Proxy Server (LLM Gateway) — Self-hosted Docker container running an OpenAI-compatible API endpoint. Any client that works with OpenAI works with the proxy, zero changes.
# Start proxy
litellm --model ollama/llama3.1 --fallback gpt-4o-mini
# Or via Docker with config
docker run -p 4000:4000 ghcr.io/berriai/litellm:main \
-c /app/config.yamlThe proxy is where most SMB deployments end up: point all your app code at localhost:4000, configure providers in config.yaml, and never touch your app again when you swap models.
The Fallback Pattern That Pays For Itself
vLLM’s no-cloud-fallback problem is the clearest LiteLLM use case. A self-hosted vLLM instance running on a single GPU has no redundancy. When it goes down, everything downstream fails.
With LiteLLM as the gateway:
# config.yaml
model_list:
- model_name: gpt-4o-mini-router
litellm_params:
model: ollama/llama3.1 # Primary: your hardware
fallback: gpt-4o-mini # Secondary: cloud API
max_retries: 3Traffic hits your local Ollama or vLLM first. On failure, it routes to GPT-4o-mini. You pay $0.15/million tokens on overflow instead of having a production incident. For SMBs running <11B tokens/month where managed APIs are cheaper anyway, this hybrid pattern gives you data-residency benefits for most queries while staying live 100% of the time.
Cost Routing
LiteLLM’s router supports model routing by request complexity — though this requires custom logic in your implementation. The standard pattern is:
- Route simple classification or extraction tasks to cheaper models (GPT-4o-mini, Claude Haiku)
- Reserve expensive models (GPT-4o, Claude Sonnet) for complex reasoning
- Set per-key budgets so no single integration can overspend
Virtual key budgets are the most important feature for SMBs. A $50/month budget cap on the n8n integration key means an agent loop can’t silently generate a $3,000 bill before anyone notices.
Observability with Langfuse
One line adds full cost and quality observability:
import litellm
litellm.success_callback = ["langfuse"]Every LiteLLM call now logs to Langfuse: token cost, latency, model used, and full trace.
Where It Breaks
Config management overhead. LiteLLM’s config.yaml grows fast. 10 models, 3 fallback chains, 20 virtual keys — it becomes infrastructure code that needs version control and review. Treat it like application config, not a one-time setup.
Caching complexity. Semantic caching is available but adds another layer to debug. Stale cached responses are a real failure mode when you’re trying to test prompt changes.
Not a replacement for load balancing at scale. LiteLLM handles model-level routing. For 500+ concurrent users, you still need proper infrastructure load balancing in front of your vLLM instances. LiteLLM fits in the stack, not on top of it.
When to Choose It
- Running Ollama or vLLM and need cloud fallback for uptime
- Multiple teams or applications sharing LLM access and you need cost visibility per team
- Switching between providers and tired of rewriting SDK calls
- Compliance allows cloud fallback for <5% of traffic but requires on-prem for the rest
Default for solo implementers: Skip LiteLLM until you have two LLM endpoints to manage. Your first 2 AM outage will decide the rest.
Related
- Ollama — Local inference that LiteLLM routes as primary
- vLLM — Production inference behind LiteLLM for failover
- Langfuse — Observability layer LiteLLM integrates with natively
- TCO — The cost math that determines when routing saves money
- Self-Hosted AI — The build-vs-buy framework LiteLLM extends
- Cost Overrun — What happens without per-key budget caps
- Infrastructure Layer — Where LiteLLM sits in the architecture
- Stack & Tools — Platform profiles