The GPU rental ad shows $1.49/hour. The invoice shows DevOps salary, model updates, and the 2 AM outage nobody planned for.
Infrastructure decisions define your cost structure and compliance posture for years. The wrong call is expensive to reverse. For SMB solo implementers, the default is almost always managed APIs — until the math or the regulator says otherwise.
Cloud APIs vs Self-Hosting
Managed APIs (OpenAI, Anthropic, Google) win on cost for roughly 87% of use cases. Self-hosting only becomes viable at very high volume — around 11 billion tokens/month — or when compliance mandates on-premise inference.
| Volume | Winner |
|---|---|
| Under $10K/month API spend | Managed APIs |
| 50M tokens/day on GPT-4o-mini | API: ~$2,250/mo vs self-hosted: ~$5,175/mo |
| 500M+ tokens/day, steady | Self-hosting can win 5× on cost |
Hidden self-hosting costs: DevOps engineering, model refresh cycles every 6–8 weeks, networking, load balancing, and downtime that falls on you, not the provider.
The Decision Framework
| Concurrent Users | Start With |
|---|---|
| Under 5 | Ollama (local) or managed API |
| 5–50 | vLLM on single GPU or managed API |
| 50–500 | vLLM with tensor parallelism |
| 500+ | Multiple vLLM instances behind load balancer |
Most SMBs sit comfortably in the top two rows. If you’re in the bottom half, you already know it.
Key Content
- Self-Hosted AI — When to run your own
- Ollama — Local development and small teams
- vLLM — Production-scale inference
- Deployment Patterns — Architecture by team size
- TCO — Real cost numbers