LiteLLM

Your self-hosted vLLM endpoint goes down at 2 AM. Without a gateway, your product is down. With LiteLLM, your traffic automatically reroutes to a cloud API. The overflow bill is $12. The alternative is a downtime incident.

LiteLLM is an open-source library and proxy server that gives you a single, unified interface to 100+ LLM providers — OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Ollama, vLLM, and more. Every response comes back in the OpenAI Chat Completions format regardless of which provider handled it. Your application code stops caring which model is actually running.

What It Does

Unified API: One completion() call works across every supported provider. No rewriting app code when you switch models.
Automatic fallback and load balancing: Routes to a backup endpoint on failure. Distributes traffic across deployments. Configurable retry logic with exponential backoff.
Virtual keys and budgets: Per-team, per-user, per-project API keys with hard spend limits. A retry loop can’t silently generate a $3,000 bill when there’s a $50 cap on the key.
Cost tracking: Token usage and cost per request, per key, per team, across all providers in one place.
Observability integration: Native integrations with Langfuse, MLflow, Helicone via a single callback line.

Deployment Options

Python SDK — Drop-in replacement for the OpenAI Python client. If your code calls openai.chat.completions.create(), you change three lines and gain access to every provider LiteLLM supports.

Proxy Server (LLM Gateway) — Self-hosted Docker container running an OpenAI-compatible API endpoint. Any client that works with OpenAI works with the proxy, zero changes.

# Start proxy
litellm --model ollama/llama3.1 --fallback gpt-4o-mini
 
# Or via Docker with config
docker run -p 4000:4000 ghcr.io/berriai/litellm:main \
  -c /app/config.yaml

The proxy is where most SMB deployments end up: point all your app code at localhost:4000, configure providers in config.yaml, and never touch your app again when you swap models.

The Fallback Pattern That Pays For Itself

vLLM’s no-cloud-fallback problem is the clearest LiteLLM use case. A self-hosted vLLM instance running on a single GPU has no redundancy. When it goes down, everything downstream fails.

With LiteLLM as the gateway:

# config.yaml
model_list:
  - model_name: gpt-4o-mini-router
    litellm_params:
      model: ollama/llama3.1          # Primary: your hardware
      fallback: gpt-4o-mini           # Secondary: cloud API
      max_retries: 3

Traffic hits your local Ollama or vLLM first. On failure, it routes to GPT-4o-mini. You pay $0.15/million tokens on overflow instead of having a production incident. For SMBs running <11B tokens/month where managed APIs are cheaper anyway, this hybrid pattern gives you data-residency benefits for most queries while staying live 100% of the time.

Cost Routing

LiteLLM’s router supports model routing by request complexity — though this requires custom logic in your implementation. The standard pattern is:

Route simple classification or extraction tasks to cheaper models (GPT-4o-mini, Claude Haiku)
Reserve expensive models (GPT-4o, Claude Sonnet) for complex reasoning
Set per-key budgets so no single integration can overspend

Virtual key budgets are the most important feature for SMBs. A $50/month budget cap on the n8n integration key means an agent loop can’t silently generate a $3,000 bill before anyone notices.

Observability with Langfuse

One line adds full cost and quality observability:

import litellm
litellm.success_callback = ["langfuse"]

Every LiteLLM call now logs to Langfuse: token cost, latency, model used, and full trace.

Where It Breaks

Config management overhead. LiteLLM’s config.yaml grows fast. 10 models, 3 fallback chains, 20 virtual keys — it becomes infrastructure code that needs version control and review. Treat it like application config, not a one-time setup.

Caching complexity. Semantic caching is available but adds another layer to debug. Stale cached responses are a real failure mode when you’re trying to test prompt changes.

Not a replacement for load balancing at scale. LiteLLM handles model-level routing. For 500+ concurrent users, you still need proper infrastructure load balancing in front of your vLLM instances. LiteLLM fits in the stack, not on top of it.

When to Choose It

Running Ollama or vLLM and need cloud fallback for uptime
Multiple teams or applications sharing LLM access and you need cost visibility per team
Switching between providers and tired of rewriting SDK calls
Compliance allows cloud fallback for <5% of traffic but requires on-prem for the rest

Default for solo implementers: Skip LiteLLM until you have two LLM endpoints to manage. Your first 2 AM outage will decide the rest.

Ollama — Local inference that LiteLLM routes as primary
vLLM — Production inference behind LiteLLM for failover
Langfuse — Observability layer LiteLLM integrates with natively
TCO — The cost math that determines when routing saves money
Self-Hosted AI — The build-vs-buy framework LiteLLM extends
Cost Overrun — What happens without per-key budget caps
Infrastructure Layer — Where LiteLLM sits in the architecture
Stack & Tools — Platform profiles

WyrdWerk Deployment Wiki

Explorer

LiteLLM

What It Does

Deployment Options

The Fallback Pattern That Pays For Itself

Cost Routing

Observability with Langfuse

Where It Breaks

When to Choose It

Graph View

Table of Contents

Backlinks

WyrdWerk Deployment Wiki

Explorer

LiteLLM

What It Does

Deployment Options

The Fallback Pattern That Pays For Itself

Cost Routing

Observability with Langfuse

Where It Breaks

When to Choose It

Related

Graph View

Table of Contents

Backlinks