Ollama

One command, local LLM. Perfect for day one. Dangerous for day ninety if you skipped the concurrency math.

Ollama packages model weights, quantization, and memory management into a single binary. Download a model, serve it with an OpenAI-compatible API at localhost:11434, and point your existing app code at it — swap the base URL, keep everything else.

What It Does

Run Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, and thousands of GGUF models
OpenAI-compatible /v1/chat/completions endpoint
Automatic GPU layer allocation (CUDA, Metal, ROCm)
One-command install and model pull

Typical Costs

Item	Cost
Software	Free, open-source
Prototyping	$0 on existing laptop or desktop
Dedicated GPU (RTX 4090)	~$1,600 one-time + electricity
Cloud GPU (A100 24/7)	~$860–$1,145/month

Where It Breaks

Concurrency Collapse

Ollama caps at roughly 4 parallel requests by default. At 10 concurrent users, total throughput can drop to ~41 tokens/second. It’s built for development, not production. That ceiling comes up faster than you’d expect.

Memory Overflow

Long-context requests without explicit limits overflow the KV cache and trigger OOM kills. Configure concurrency and context limits before sharing the instance.

Model Drift (Self-Hosted)

Unlike managed APIs, your model version stays frozen until you manually update it. You’re responsible for re-quantization, testing, and redeployment every 6–8 weeks as new models ship. That’s a real time cost. Plan for it.

When to Choose It

Local development and prompt testing
Air-gapped or regulated environments (data never leaves your hardware)
Internal tools with fewer than 5 concurrent users
GDPR/DPDP scenarios where API routing crosses borders you can’t accept

Don’t use Ollama for: Customer-facing production at scale. That’s vLLM territory — or managed APIs.

Self-Hosted AI — When to run your own models
Data Residency — Where data lives and why it matters
vLLM — Production-scale inference
Infrastructure Layer — Hosting decisions
Silent Agent Failure — Drift you can’t see
Open WebUI — Browser-based chat interface that runs on top of Ollama
LiteLLM — API gateway for routing between Ollama and cloud fallbacks
Model Quantization — The GGUF format Ollama uses and how to choose Q-levels

WyrdWerk Deployment Wiki

Explorer

Ollama

What It Does

Typical Costs

Where It Breaks

Concurrency Collapse

Memory Overflow

Model Drift (Self-Hosted)

When to Choose It

Graph View

Table of Contents

Backlinks

WyrdWerk Deployment Wiki

Explorer

Ollama

What It Does

Typical Costs

Where It Breaks

Concurrency Collapse

Memory Overflow

Model Drift (Self-Hosted)

When to Choose It

Related

Graph View

Table of Contents

Backlinks