Langfuse

Only 62% of organizations running AI agents in production can inspect what their agents actually do at each step. The other 38% find out from users.

Langfuse is an open-source AI engineering platform for tracing, evaluating, and monitoring LLM applications in production. It closes the observability gap that makes AI operations fundamentally different from traditional software monitoring: standard infrastructure tools catch CPU spikes and 500 errors, but a model returning a confident wrong answer comes back as a successful 200 OK.

What It Does

Traces: Logs every LLM call, tool call, retrieval step, and embedding — including cost, latency, full input/output, and multi-turn session context
Cost tracking: Per-request, per-user, per-key cost breakdown across all providers in one view
Prompt management: Version control, deploy, and compare prompt versions by latency, cost, and quality metrics — no code deploys required
LLM-as-judge evaluations: Automated quality scoring on production traces; human annotation queue for review when scores drop below threshold
Dashboards: Quality, cost, and latency in one view — the operations monitoring panel you’d otherwise build yourself

The LLM-as-Judge Integration

The drift monitoring playbook calls for a secondary model scoring 1% of production traffic for correctness, continuously. Langfuse implements this without custom code:

Configure an evaluator (e.g., Claude Haiku) with a scoring rubric
Langfuse samples production traces automatically
Scores appear in the dashboard alongside cost and latency
Alert threshold triggers human review when quality score drops

The law firm incident in the LLM drift content — model drifted in legal reasoning over 8 weeks, no one noticed — is exactly the failure this catches. Langfuse running LLM-as-judge on production traffic would have flagged the quality drop at week 2, not when a client complained at week 8.

Deployment Options

Cloud (langfuse.com): Hosted version with free tier. Suitable for development, small teams, and testing. Data stays in Langfuse’s infrastructure — not appropriate for deployments with strict data residency requirements.

Self-hosted (open source): Docker Compose deployment on your own infrastructure. Full data control. Suitable for GDPR/DPDP environments where LLM trace data (which includes real user inputs and outputs) can’t leave your network.

# Self-hosted via Docker Compose
git clone https://github.com/langfuse/langfuse.git
docker compose up

The self-hosted path is the right default for any deployment where compliance matters — which is most SMB deployments in regulated sectors.

Integration

LiteLLM integration (one line):

import litellm
litellm.success_callback = ["langfuse"]

Direct SDK (Python):

from langfuse import Langfuse
langfuse = Langfuse()
 
# Wrap your LLM calls
with langfuse.trace(name="customer-support") as trace:
    response = openai.chat.completions.create(...)
    trace.score(name="quality", value=0.9)

Framework integrations: OpenAI SDK, LangChain, LlamaIndex, and 50+ others via OpenTelemetry-compatible instrumentation.

The Solo Implementer Reality

For a solo implementer managing AI, Langfuse answers the question “is my deployment still working correctly?” — which is otherwise invisible.

Traditional monitoring: the server is up, response times are normal, no 500 errors.

Langfuse: accuracy on customer intent classification dropped from 89% to 71% this week. Cost per session jumped 40% after the last prompt change. Three trace IDs produced outputs below your quality threshold.

The second list is what running an AI deployment actually requires.

Where It Breaks

Trace volume costs at scale. Logging every LLM call adds latency (minimal — async by default) and storage costs. At 1M+ traces/month, storage starts mattering. The practical fix: sample at 10-20% for production monitoring, 100% for development and testing.

Evaluation latency. LLM-as-judge evaluations run asynchronously but still take seconds per trace. For real-time quality gates (reject bad responses before the user sees them), Langfuse is post-hoc — not a synchronous guardrail.

Data in traces is sensitive. Your LLM traces contain real user inputs and outputs. Self-hosting isn’t optional if those inputs include PII, health data, or financial details.

When to Choose It

You have any AI feature in production and no visibility into output quality
Running self-hosted models (Ollama/vLLM) and need to catch drift before users do
Multiple team members modifying prompts and you need to compare versions empirically
Compliance requires on-premise trace storage

Default: Add Langfuse before going live, not after your first quality incident.

LLM Drift — The degradation Langfuse detects
Silent Agent Failure — The failure mode Langfuse surfaces
Cost Overrun — Cost tracking prevents billing surprises
LiteLLM — Routes LLM traffic to Langfuse in one callback line
Operations & Maintenance — Where monitoring lives in your stack
Stack & Tools — Platform profiles

WyrdWerk Deployment Wiki

Explorer

Langfuse

What It Does

The LLM-as-Judge Integration

Deployment Options

Integration

The Solo Implementer Reality

Where It Breaks

When to Choose It

Graph View

Table of Contents

Backlinks

WyrdWerk Deployment Wiki

Explorer

Langfuse

What It Does

The LLM-as-Judge Integration

Deployment Options

Integration

The Solo Implementer Reality

Where It Breaks

When to Choose It

Related

Graph View

Table of Contents

Backlinks