Langfuse vs LangSmith: Two competing AI observability platforms compared

Two approaches to agent observability: open-source flexibility vs managed ecosystem depth

Langfuse vs LangSmith: Two competing AI observability platforms compared

A router agent classifies the intent
A research sub-agent calls three different tools
Each tool call generates its own LLM interaction
The results flow back through a synthesis step
A writer agent composes the final response

Agent trace tree showing hierarchical operations

Hierarchical traces that show parent-child relationships between agent steps
Token and cost tracking at every node in the tree
Latency attribution so you know which tool call is your bottleneck
Evaluation hooks to score agent behavior at each decision point
Session tracking to connect multi-turn conversations
Python SDK v3 with @observe() decorator
OpenTelemetry-native spans and generations
80+ integrations (see Langfuse integrations page for current count): LangChain, LangGraph, OpenAI, Claude, LlamaIndex, Vercel AI SDK, and custom agents
Self-hosting with full feature parity using Docker
ClickHouse backend with raw SQL access for custom analytics
Unit-based pricing with a free Hobby tier and a $29/month Core tier, both with unlimited users; higher Pro/Enterprise tiers exist for larger workloads
LangSmith Python SDK with @traceable decorator
Run Trees with hierarchical run/span structure
Deepest LangChain and LangGraph support: native integration, LangGraph Studio IDE, Agent Builder
Native alerting: threshold-based monitoring with webhooks
Annotation Queues for human-in-the-loop evaluation
Per-seat pricing: $39/seat/month on the Plus tier

from
 langfuse import observe, get_client, propagate_attributes
from
 langfuse.langchain import CallbackHandler
from
 langchain_openai import ChatOpenAI
from
 langchain_core.prompts import ChatPromptTemplate
# WHAT: @observe() is Langfuse's core instrumentation decorator.
# WHY: It automatically captures function inputs, outputs, execution time,
#      and any exceptions -- without modifying your function's logic.
# WHEN: Apply to every function that represents a meaningful agent step.
#       Think of it as "anything you'd want to see in a trace."
@
observe
()
def 
classify_intent
(
query
: str) -> str:
    
""
"Classify user intent using LLM."
""
    
# WHAT: CallbackHandler bridges LangChain's internal event system to Langfuse.
    
# WHY: LangChain does not expose LLM calls directly. The callback handler
    
#      intercepts LLM start/end events and creates child spans automatically.
    
# WHEN: Create a new instance per chain invocation to avoid context leakage
    
#       between concurrent requests.
    langfuse_handler = 
CallbackHandler
()
    llm = 
ChatOpenAI
(model_name=
"gpt-5.4"
)
    prompt = ChatPromptTemplate.
from_template
(
        
"Classify this query as 'research', 'action', or 'chat': {query}"
    )
    chain = prompt | llm
    
# Passing the handler via config keeps instrumentation separate from logic.
    
# The chain runs normally; the handler observes silently.
    result = chain.
invoke
(
        {
"query"
: query},
        config={
"callbacks"
: [langfuse_handler]}
    )
    
return
 result.content
@
observe
()
def 
retrieve_context
(
query
: str) -> str:
    
""
"Retrieve relevant context using tools."
""
    
# WHAT: Tool calls inside an @observe()-decorated function are automatically
    
#       linked as child spans in the trace tree.
    
# WHY: OpenTelemetry context propagation handles the parent-child relationship.
    
#      You do not need to pass trace IDs manually.
    
return
 
search_database
(query)
@
observe
()
def 
generate_response
(
query
: str, 
context
: str) -> str:

import
 
langsmith
 
as
 
ls
from
 
langsmith
 
import
 
traceable
from
 
langsmith
.wrappers
 
import
 
wrap_openai
import
 
openai
# 
WHAT
: 
wrap_openai
() 
monkey-patches
 
the
 
OpenAI
 
client
 
to
 
intercept
 
all
 
API
 
calls
.
# 
WHY
: 
Instead
 
of
 
passing
 
a
 
callback
 
to
 
every
 
chain
, 
you
 
instrument
 
the
 
client
 
once
.
#      
Every
 
subsequent
 
openai
 
call
 
is
 
automatically
 
captured
 
as
 
a
 
trace
 
span
.
# 
WHEN
: Call this once at module initialization, before any LLM calls are made.
#       This is the 
"wrap once, trace everywhere"
 pattern.
client = 
wrap_openai
(openai.
Client
())
# 
WHAT
: 
@traceable
 declares this function as a traceable unit in a Run Tree.
# 
WHY
: run_type provides semantic meaning to the dashboard. 
"chain"
 means a
#      sequence of steps; 
"tool"
 means a discrete capability invocation.
#      This lets you filter 
and
 analyze by operation type.
# 
WHEN
: Use run_type=
"chain"
 for multi-step sequences 
and
 run_type=
"tool"
#       for discrete capability invocations like database queries or API calls.
@traceable
(run_type=
"chain"
, name=
"Classify Intent"
)
def 
classify_intent
(
query
: str) -> 
str
:
    
""
"Classify user intent using LLM."
""
    # The wrapped client automatically links this LLM call as a child of
    # the current 
@traceable
 context. No callback configuration required.
    response = client.chat.completions.
create
(
        model=
"gpt-5.4"
,
        messages=[{
            
"role"
: 
"user"
,
            
"content"
: f
"Classify as 'research', 'action', or 'chat': {query}"
        }]
    )
    return response.choices[
0
].message.content
# run_type=
"tool"
 signals to LangSmith that this is a discrete capability,
# 
not
 a reasoning step. It affects how the run is displayed 
and
 aggregated.
@traceable
(run_type=
"tool"
, name=
"Retrieve Context"
)
def 
retrieve_context
(
query
: str) -> 
str
:
    
""
"Retrieve relevant context using tools."
""
    return 
search_database
(query)
@traceable
(run_type=
"chain"
, name=
"Generate Response"
)
def 
generate_response
(
query
: str, 
context
: str) -> 
str
:
    
""
"Generate final response."
""

Langfuse vs LangSmith: Two competing AI observability platforms compared

Wraps a function as an “observation” and records inputs, outputs, timing, and exceptions.
Best used to mark meaningful agent steps, such as routing, retrieval, and synthesis.
Plays nicely with OpenTelemetry-style nesting, so a parent function naturally contains child spans.
Wraps a function as a “run” inside a Run Tree.
Adds richer labeling controls, such as a human-friendly name and semantic run_type.
Works especially well when you want dashboards and evaluations to group steps by type.
You pass a Langfuse callback into each LangChain invocation so it can listen to LangChain’s internal events.
More explicit and granular, which is useful when you only want to trace certain chains, or you need to separate contexts for concurrency.
Slightly more boilerplate, since each invocation point must be configured correctly.
Instruments the OpenAI client once so that any subsequent calls are captured automatically.
Very ergonomic for projects that directly use the OpenAI SDK, and for teams that want “set it and forget it” instrumentation.
Global wrapping can be harder to reason about if you need fine-grained control over what is traced.
The platform infers structure from spans and LLM “generation” events.
Less configuration, but also fewer explicit semantics when you want to distinguish tools vs reasoning steps.
You label each unit of work as chain, tool, llm, and so on.
Improves filtering, aggregation, and evaluation workflows because the UI can treat different run types differently.
Uses industry-standard context propagation patterns.
Easier to integrate with existing observability stacks and standards-based tooling.
Typically offers better interoperability across languages and services.
Uses LangSmith’s own run context model.
Often “just works” inside the LangChain and LangGraph ecosystem.
Interoperability outside that ecosystem may require additional adapters or exports.
You explicitly attach identifiers, such as user ID, conversation ID, or tenant ID, to connect traces across turns.
This makes the linking logic obvious and portable, but it does require discipline to set consistently.
Many projects enable tracing and session behavior through environment-level configuration.
Great for retrofitting tracing without touching much code, especially in LangGraph.
You still need a clear convention for conversation identifiers so multi-turn threads are reliably connected.

from langfuse.langchain import CallbackHandler
from langgraph.graph import StateGraph, 
END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from typing import TypedDict, Annotated
import operator
# WHAT: TypedDict defines the shape of state flowing through the graph.
# WHY: LangGraph passes this state dictionary between nodes. Explicit typing
#      prevents silent data loss when a node forgets to return a field.
# WHEN: Always type your state. Untyped state is the leading cause of
#       hard-to-debug graph execution errors.
class
 
AgentState
(TypedDict):
    
messages:
 Annotated[list, operator.add]  
# operator.add means messages accumulate
    
research_results:
 str
    
final_answer:
 str
def
 
research_node
(
state:
 AgentState
) -> 
AgentState:
    
""
"Research sub-agent with tool calls."
""
    llm = ChatOpenAI(model=
"gpt-5.4"
)
    
# WHAT: The LLM call here will be captured as a nested span under this node's span.
    
# WHY: Langfuse's CallbackHandler (passed at invocation time) intercepts
    
#      LangChain's callback events and creates child spans automatically.
    result = llm.invoke(state[
"messages"
])
    
return
 {
"messages"
: [result], 
"research_results"
: result.content}
def
 
synthesis_node
(
state:
 AgentState
) -> 
AgentState:
    
""
"Synthesize research into final answer."
""
    llm = ChatOpenAI(model=
"gpt-5.4"
)
    result = llm.invoke([
        HumanMessage(content=f
"Synthesize: {state['research_results']}"
)
    ])
    
return
 {
"messages"
: [result], 
"final_answer"
: result.content}
# Build the graph
graph = StateGraph(AgentState)
graph.add_node(
"research"
, research_node)
graph.add_node(
"synthesis"
, synthesis_node)
graph.set_entry_point(
"research"
)
graph.add_edge(
"research"
, 
"synthesis"
)
graph.add_edge(
"synthesis"
, 
END
)
app = graph.compile()
# WHAT: Passing CallbackHandler in the config dict enables Langfuse tracing
#       for the entire graph execution, including all nested LLM calls.
# WHY: LangGraph propagates the config (and t

# WHAT: LangSmith traces LangGraph automatically via environment variables.
# WHY: LangSmith is built by the same team as LangChain and LangGraph.
#      The integration is baked into LangGraph's core execution loop.
# WHEN: Set these variables once in your environment or .env file.
#       No code changes needed to enable full tracing.
#
# LANGSMITH_TRACING=true
# LANGSMITH_API_KEY=your-key
from
 langgraph.graph 
import
 StateGraph, END
from
 langchain_openai 
import
 ChatOpenAI
from
 langchain_core.messages 
import
 HumanMessage
# Same graph definition as above...
app = graph.
compile
()
# WHAT: With environment variables set, this invocation is fully traced.
# WHY: LangGraph calls LangSmith's tracing hooks at the framework level,
#      not the application level. The graph, nodes, edges, and LLM calls
#      all appear in LangSmith without any explicit instrumentation.
# WHEN: This zero-config approach works well for teams building exclusively
#       with LangGraph. No per-request setup, no callback management.
result = app.invoke(
    {
"messages"
: [HumanMessage(content=
"Explain quantum computing"
)],
     
"research_results"
: 
""
, 
"final_answer"
: 
""
}
)
# No callback handler needed - tracing is automatic

LangSmith’s advantage here is zero-friction setup.

from langchain_core.tools import tool
from langfuse import Langfuse
langfuse = Langfuse()
# WHAT: start_as_current_observation() creates a new span that is explicitly
#       linked to a parent trace via trace_context.
# WHY: When a sub-agent runs in a separate function or thread, OpenTelemetry's
#      automatic context propagation may not carry over. This explicit linking
#      ensures the sub-agent's work appears nested under the parent trace.
# WHEN: Use this pattern when sub-agents are invoked as tools (decorated with
#       
@tool
), run in separate threads, or execute in a different async context.
@tool
def
 
research_sub_agent
(
question:
 str
) -> 
str:
    
""
"Sub-agent that maintains trace context."
""
    with langfuse.start_as_current_observation(
        name=
"research-sub-agent"
,
        trace_context={
"trace_id"
: parent_trace_id}  
# Explicitly link to parent
    ) as 
span:
        span.update(input=question)   
# Log what the sub-agent received
        result = sub_graph.invoke(
            {
"messages"
: [HumanMessage(content=question)]},
            config={
"callbacks"
: [CallbackHandler()]}
        )
        span.update(output=result[
"messages"
][-
1
].content)  
# Log what it produced
    
return
 result[
"messages"
][-
1
].content

# WHAT: Nesting 
@traceable
 decorators creates automatic parent-child relationships
#       in the Run Tree without any manual trace ID management.
# WHY: LangSmith uses Python's context variable mechanism to track the current
#      active run. When a 
@traceable
 function calls another 
@traceable
 function,
#      the inner function automatically becomes a child run.
# WHEN: This works seamlessly for synchronous call chains. For async or
#       thread-based sub-agents, you may need to pass run_id explicitly.
@traceable
(name=
"Research Sub-Agent"
)
def
 
research_sub_agent
(
question:
 str
) -> 
str:
    
""
"Sub-agent - trace context propagates automatically."
""
    result = sub_graph.invoke(
        {
"messages"
: [HumanMessage(content=question)]}
    )
    
return
 result[
"messages"
][-
1
].content

Annotation Queues: Human reviewers score traces through a dedicated UI workflow
Pairwise Comparison: A/B test two model configurations side-by-side
LLM-as-Judge: Automated scoring with rubric instructions and strict structured output
Heuristic Evaluators: Pre-built templates for common checks (contains, regex, length)
Online Evaluation: Real-time scoring of production traces

from
 langsmith 
import
 Client
from
 langsmith.evaluation 
import
 evaluate
client = Client()
# WHAT: create_dataset() builds a reusable set of input-output pairs for evaluation.
# WHY: Evaluations against a fixed dataset let you measure the effect of model
#      changes, prompt changes, or architecture changes on the same inputs.
#      Without a dataset, you are comparing apples to oranges.
# WHEN: Create datasets for your most critical use cases. Start with 20-50
#       representative examples covering edge cases and common patterns.
dataset = client.create_dataset(
"agent-eval-set"
)
# WHAT: evaluate() runs your agent against the dataset and scores each run.
# WHY: experiment_prefix groups related evaluation runs so you can compare
#      "agent-v1" vs "agent-v2" side-by-side in the LangSmith UI.
# WHEN: Run evaluations before deploying any significant change to prompts,
#       models, or agent architecture.
results = evaluate(
    agent_pipeline,
    data=dataset,
    evaluators=[
"correctness"
, 
"helpfulness"
],
    experiment_prefix=
"agent-v2"
)

LangSmith provides a batteries-included evaluation framework.

LLM-as-Judge: supports categorical scores running on your infrastructure
Ragas Integration: RAG-specific metrics (retrieval relevance, answer faithfulness, context completeness)
Custom Scorers: Full SDK access for building custom evaluation logic
Versioned Dataset Experiments: Reproducible evaluations with timestamp-based versioning
Evaluator Tracing: Debug your evaluators by tracing their own LLM calls

from langfuse import Langfuse
langfuse 
=
 Langfuse
(
)
# WHAT: langfuse.score() attaches a numeric or categorical score to a specific trace.
# WHY: Scores connect evaluation results back to the exact trace that produced them.
#      You can filter traces by score to find the worst-performing runs for debugging.
# WHEN: Use programmatic scoring for automated quality checks that run in production,
#       such as validating that responses meet minimum length or format requirements.
langfuse.score
(
    trace_id
=
"trace-abc-123"
,
    name
=
"correctness"
,
    value
=
0.85
,
    comment
=
"Answer was mostly correct but missed edge case"
)
# WHAT: CATEGORICAL data_type enables string-valued scores instead of numeric ones.
# WHY: Some evaluation criteria are categorical by nature. "Did this hallucinate?"
#      has three meaningful answers: yes, no, or uncertain. Forcing that into a
#      0-1 scale loses information.
# WHEN: Use CATEGORICAL for binary or multi-class checks like hallucination detection,
#       tone classification, or policy compliance.
langfuse.score
(
    trace_id
=
"trace-abc-123"
,
    name
=
"hallucination-check"
,
    value
=
"no_hallucination"
,  
# Categorical
    data_type
=
"CATEGORICAL"
)

Langfuse vs LangSmith: Two competing AI observability platforms compared

Langfuse: Annotation Queues with structured human review flows, open-sourced in mid-2025. Supports coordinated reviewer workflows and review throughput management.
LangSmith: Annotation Queues with a richer reviewer experience. Designed for structured, repeatable human-in-the-loop review at scale.
Langfuse: Yes, plus evaluator tracing. You can trace and debug the evaluator itself, which helps when judge outputs look inconsistent or surprising.
LangSmith: Yes, with rubric instructions and structured scoring patterns. Great when you want a standardized evaluation setup with clear criteria.
Langfuse: No first-class pairwise UI today. You can still run A/B tests, but you typically build the comparison workflow yourself.
LangSmith: Yes. Built-in pairwise evaluation makes it easier to compare prompts, models, and agent versions in a consistent workflow.
Langfuse: Ragas integration, which is useful for retrieval relevance, faithfulness, and context completeness metrics without reinventing common RAG evaluation logic.
LangSmith: Manual setup. You can absolutely evaluate RAG quality, but you will likely wire up metrics and evaluators more explicitly.
Langfuse: Via API. Flexible and programmable, but you own more of the “production scoring pipeline” implementation details.
LangSmith: Native real-time evaluation workflows. Easier to operationalize ongoing scoring and monitoring.
Langfuse: Full SDK access. Best when you want to implement bespoke evaluators, custom scoring schemas, and deeper integration with internal systems.
LangSmith: Template-based plus custom options. Fast to get started with common evaluators, with customization available when needed.
Langfuse: Yes, including versioned dataset experiments. Useful when you want reproducibility tied to dataset snapshots over time.
LangSmith: Yes. Strong support for running repeatable experiments on a stable dataset.
Docker Compose deployment for small teams
Kubernetes with Helm charts for production
ClickHouse backend with raw SQL access for custom analytics
Air-gapped deployments with no vendor contact required
Open-source core is fully feature-complete; Enterprise tier adds license-keyed features (SSO, advanced RBAC, dedicated support, etc.)
Requires a sales contract and license key
Helm chart deployment with RDS/K8s support (see current LangSmith docs for chart version)
Dynatrace integration for monitoring
No self-hosting on free or Plus tiers

Langfuse vs LangSmith: Two competing AI observability platforms compared

Monthly cost: Free on Hobby tier; $29/month on Core tier. A 10-person team at 100K traces/month most likely lands on Pro (check current pricing page for exact tier thresholds).
Annual cost: $348 on Core tier; higher on Pro tier (verify against current Langfuse pricing for the right tier for 100K traces/month). Spend is volume-based, not headcount-based.
Per-trace cost: Unit-based. Usage drives cost more than headcount, which can be favorable when your team grows faster than your traffic.
Scaling cost: Low (units). Adding engineers does not automatically increase the bill, but higher trace volume will.
Monthly cost: $390/month (10 seats). This is per-seat pricing, so it grows directly with team size.
Annual cost: $4,680 for seats alone ($390 * 12). Add per-1,000-trace overage charges for any traces above the included allowance — for a 100K-trace scenario, verify the current LangSmith pricing page for the total effective cost.
Per-trace cost: LangSmith Plus bills both per seat and per trace (trace overages apply above the included limit; for 100K traces this adds meaningful per-1,000-trace charges on top of the seat cost — verify current overage rates in LangSmith docs).
Scaling cost: High (add seats). Every additional teammate increases recurring cost, even if traffic is flat.
Monthly cost: $0 (infra costs only). There is no SaaS subscription, but you pay in cloud compute, storage, and operational time.
Annual cost: ~$1,200–2,400 (infra). This is the estimated hosting range and will vary based on retention, throughput, and reliability requirements.
Per-trace cost: $0 (platform). The platform does not charge per trace, but your infrastructure effectively becomes the cost driver.
Scaling cost: Fixed (infra). Costs tend to step up in chunks as you outgrow a deployment size, rather than increasing with each hire.
Threshold-based alerts for cost, latency, token usage, and error rates
Webhook integrations for Slack, PagerDuty, and custom endpoints
“Threads” feature for clustering similar conversations and identifying systemic issues
Configurable monitoring dashboards
Metrics API for programmatic access to trace data
OpenTelemetry export to Datadog, Grafana, or your existing monitoring stack
ClickHouse SQL for custom alert queries (self-hosted)
Webhook integrations for custom pipelines

Decision framework: choosing between Langfuse and LangSmith

Only LangChain/LangGraph? LangSmith deserves serious consideration.
Mixed stack (OpenAI SDK, Claude, LlamaIndex, custom agents)? Langfuse is the stronger choice.
Healthcare, finance, or government with strict data residency rules? Langfuse self-hosted is often the only compliant option without an enterprise contract.
No strict requirements? Both work.
Under 5 engineers? LangSmith’s per-seat cost is manageable ($195/month).
10+ engineers? Langfuse’s flat pricing saves thousands per year.
Already running Kubernetes and Grafana? Langfuse self-hosted adds minimal overhead.
No infrastructure team? LangSmith’s managed SaaS removes the operational burden.
You use multiple frameworks (not just LangChain)
Self-hosting and data sovereignty are requirements
Your team is larger than 5 engineers (cost savings compound)
You want raw SQL access to trace data
You prefer open-source with no vendor lock-in
You already run Grafana or Datadog for monitoring
You build exclusively with LangChain and LangGraph
You want LangGraph Studio IDE for visual debugging
You need native alerting without additional infrastructure
You prefer managed SaaS with zero operational overhead
Human-in-the-loop evaluation (Annotation Queues) is critical
Your team is small (under 5 engineers) and cost per seat is acceptable
You are evaluating frameworks and want to test before committing
You need different tools for different projects (Langfuse for custom agents, LangSmith for LangGraph)
Agent observability requires hierarchical traces, not just request-response logs.
Langfuse uses @observe() + CallbackHandler with OpenTelemetry context propagation.
LangSmith uses @traceable + wrap_openai() with automatic Run Tree nesting.
LangSmith traces LangGraph automatically with zero code changes. Langfuse requires a callback handler at each invocation point.
Sub-agent tracing is automatic in LangSmith and explicit-but-controllable in Langfuse.
Langfuse self-hosting is free with a fully open-source core; Enterprise adds license-keyed features. LangSmith self-hosting requires an Enterprise contract.
For a 10-person team, Langfuse Cloud Core costs $348/year (Pro tier costs more — check current pricing). LangSmith Plus seat costs alone are $4,680/year, plus per-trace overages.
LangSmith has richer built-in evaluation and alerting. Langfuse integrates more flexibly with existing infrastructure.

Instrument one existing agent today. Pick the simpler platform for your stack (LangSmith if you use LangChain, Langfuse otherwise) and add tracing to one function. The first trace is the hardest. After that, the pattern is clear.
Try both free tiers before your team commits. The instrumentation experience and dashboard quality matter more than feature lists.
For self-hosted Langfuse, run docker compose up with the official Langfuse compose file. You can have a local instance running in under 30 minutes.
For LangSmith, start with the environment variable approach and wrap_openai(). It is the fastest path from zero to your first trace.
Build an evaluation dataset. Tracing tells you what happened. Scoring tells you whether it was good. Start with 20 representative test cases and run evaluations before every significant model or prompt change.