Only 62% of organizations running AI agents in production can inspect what their agents actually do at each step. The other 38% find out from users.
Langfuse is an open-source AI engineering platform for tracing, evaluating, and monitoring LLM applications in production. It closes the observability gap that makes AI operations fundamentally different from traditional software monitoring: standard infrastructure tools catch CPU spikes and 500 errors, but a model returning a confident wrong answer comes back as a successful 200 OK.
What It Does
- Traces: Logs every LLM call, tool call, retrieval step, and embedding — including cost, latency, full input/output, and multi-turn session context
- Cost tracking: Per-request, per-user, per-key cost breakdown across all providers in one view
- Prompt management: Version control, deploy, and compare prompt versions by latency, cost, and quality metrics — no code deploys required
- LLM-as-judge evaluations: Automated quality scoring on production traces; human annotation queue for review when scores drop below threshold
- Dashboards: Quality, cost, and latency in one view — the operations monitoring panel you’d otherwise build yourself
The LLM-as-Judge Integration
The drift monitoring playbook calls for a secondary model scoring 1% of production traffic for correctness, continuously. Langfuse implements this without custom code:
- Configure an evaluator (e.g., Claude Haiku) with a scoring rubric
- Langfuse samples production traces automatically
- Scores appear in the dashboard alongside cost and latency
- Alert threshold triggers human review when quality score drops
The law firm incident in the LLM drift content — model drifted in legal reasoning over 8 weeks, no one noticed — is exactly the failure this catches. Langfuse running LLM-as-judge on production traffic would have flagged the quality drop at week 2, not when a client complained at week 8.
Deployment Options
Cloud (langfuse.com): Hosted version with free tier. Suitable for development, small teams, and testing. Data stays in Langfuse’s infrastructure — not appropriate for deployments with strict data residency requirements.
Self-hosted (open source): Docker Compose deployment on your own infrastructure. Full data control. Suitable for GDPR/DPDP environments where LLM trace data (which includes real user inputs and outputs) can’t leave your network.
# Self-hosted via Docker Compose
git clone https://github.com/langfuse/langfuse.git
docker compose upThe self-hosted path is the right default for any deployment where compliance matters — which is most SMB deployments in regulated sectors.
Integration
LiteLLM integration (one line):
import litellm
litellm.success_callback = ["langfuse"]Direct SDK (Python):
from langfuse import Langfuse
langfuse = Langfuse()
# Wrap your LLM calls
with langfuse.trace(name="customer-support") as trace:
response = openai.chat.completions.create(...)
trace.score(name="quality", value=0.9)Framework integrations: OpenAI SDK, LangChain, LlamaIndex, and 50+ others via OpenTelemetry-compatible instrumentation.
The Solo Implementer Reality
For a solo implementer managing AI, Langfuse answers the question “is my deployment still working correctly?” — which is otherwise invisible.
Traditional monitoring: the server is up, response times are normal, no 500 errors.
Langfuse: accuracy on customer intent classification dropped from 89% to 71% this week. Cost per session jumped 40% after the last prompt change. Three trace IDs produced outputs below your quality threshold.
The second list is what running an AI deployment actually requires.
Where It Breaks
Trace volume costs at scale. Logging every LLM call adds latency (minimal — async by default) and storage costs. At 1M+ traces/month, storage starts mattering. The practical fix: sample at 10-20% for production monitoring, 100% for development and testing.
Evaluation latency. LLM-as-judge evaluations run asynchronously but still take seconds per trace. For real-time quality gates (reject bad responses before the user sees them), Langfuse is post-hoc — not a synchronous guardrail.
Data in traces is sensitive. Your LLM traces contain real user inputs and outputs. Self-hosting isn’t optional if those inputs include PII, health data, or financial details.
When to Choose It
- You have any AI feature in production and no visibility into output quality
- Running self-hosted models (Ollama/vLLM) and need to catch drift before users do
- Multiple team members modifying prompts and you need to compare versions empirically
- Compliance requires on-premise trace storage
Default: Add Langfuse before going live, not after your first quality incident.
Related
- LLM Drift — The degradation Langfuse detects
- Silent Agent Failure — The failure mode Langfuse surfaces
- Cost Overrun — Cost tracking prevents billing surprises
- LiteLLM — Routes LLM traffic to Langfuse in one callback line
- Operations & Maintenance — Where monitoring lives in your stack
- Stack & Tools — Platform profiles