Generative AIAI evalsTools & frameworks

Overview of Langfuse

15 minutes read

In any production system, observability platforms let you answer three fundamental questions: “what’s happening right now?”, “Why is it happening?”, and “how can I fix it?” For LLM‑powered applications, that means more than just counting requests or tracking latency—it means capturing every prompt, every API call, and every response (with context) so you can drill into exactly how your models behave in the real world.

Langfuse delivers these core observability capabilities for LLM applications and features such as prompt versioning and security tooling, so your entire development workflow lives in one place. This helps you debug, analyze, and iterate on their LLM applications.

Setup

Langfuse offers both a cloud solution with paid and free plans, as well as an open source version that can be deployed locally. The easiest way is to run Langfuse locally. To do this, just clone the project repository and launch it using Docker Compose.

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up

After that, the web interface will be available at http://localhost:3000. Langfuse supports a wide range of options for detailed configuration of the self-hosted instance. Whether you choose the cloud or self-hosted solution, you will need to register a user, create an organization, create a project, and obtain API keys.

Then, set the obtained keys as environment variables:

LANGFUSE_SECRET_KEY="sk-lf-your-key"
LANGFUSE_PUBLIC_KEY="pk-lf-your-key"

You will also need to set an environment variable indicating the Langfuse instance you are using:

# If you are using a local instance
LANGFUSE_HOST="http://127.0.0.1:3000"
# Hosted EU instance
LANGFUSE_HOST="https://cloud.langfuse.com"
# Hosted US instance
LANGFUSE_HOST="https://us.cloud.langfuse.com"

Install the Python SDK for Langfuse and OpenAI:

pip install langfuse==2.60.8 openai --upgrade

Here, we are installing version 2 of the Langfuse SDK. If you are using version 3 or a newer version, some snippets below may not work. To see what has changed, see the migration guide. Regardless of how you set up Langfuse or the version/integration of the Langfuse SDK that you use, the core concepts remain the same.

Now we can run a simple example to make sure everything works:

# Note that we don't import the OpenAI module directly,
# but instead use the wrapper provided by Langfuse
from langfuse.openai import openai
import os

client = openai.OpenAI()
 
response = client.chat.completions.create(
  model = os.environ.get("MODEL_NAME") or "gpt-4o",
  messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital city of Canada?"},
    {"role": "assistant", "content": "The capital city of Canada is Ottawa."},
    {"role": "user", "content": "Can you share three interesting facts about Ottawa?"}
  ]
)

print(response.choices[0].message.content)

After that, you will see the log of the executed generation in the web interface under the Tracing/Traces tab. Here, we used the OpenAI SDK integration. However, Langfuse allows you to trace apps built with other frameworks like LangChain, LlamaIndex, and more. How you monitor those apps is slightly different, but in all cases, you should see traces in the UI like below:

Traces and Observations

The key functionality of Langfuse is the recording and management of logs for operations performed with LLMs. The Langfuse SDKs allow tracking the complete execution flow, including API calls, context, prompts, and more. The Langfuse web interface, in turn, enables in-depth examination of the recorded logs through a “nested” display. The data model used in Langfuse to achieve observability is based on traces and observations.

A trace (highlighted in blue in the screenshot below) typically represents a single request or operation. It contains the overall input and output of the function, as well as metadata about the request, such as the user, session, and tags. Typically, a trace corresponds to a single API call of an application. Each trace can contain multiple observations (highlighted in red) to log individual execution steps like LLM invocations. Observations can be nested.

Pasted illustration

There are three kinds of observations:

Events — these mark individual point-in-time occurrences within a trace. (icon in UI )
Spans — capture the duration of a specific unit of work, measuring how long each operation takes. (icon in UI )
Generations — specialized spans that record AI model outputs, including the prompt text, token usage, and cost details. (icon in UI )

To annotate observations in your code, the Langfuse SDK provides the observe decorator. Different integrations provide different ways to capture traces.

from langfuse.decorators import observe
from langfuse.openai import openai
import os

client = openai.OpenAI()

@observe()
def plan():
    return client.chat.completions.create(
        model=os.environ.get("MODEL_NAME") or "gpt-4o",
        messages=[
          {"role": "system", "content": "You are an expert travel planner."},
          {"role": "user", "content": "Plan a three-day itinerary for a first-time visitor to Barcelona."}
        ],
    ).choices[0].message.content

@observe()
def main():
    return plan()

print(main())

Each function wrapped with this decorator encountered in the call chain will be marked as a separate observation. For each one, all inputs and outputs will be recorded.

A trace in Langfuse UI showing inputs, outputs, latency, model settings, and other metrics.

You can also augment this information with arbitrary key-value metadata explicitly. This might be helpful for when you need to make it easy to understand a trace, like during testing. Note that when using a different version/integration of the SDK, how you do this is different.

from langfuse.decorators import langfuse_context, observe

@observe
def nested():
    # Update trace metadata from anywhere inside the call stack
    langfuse_context.update_current_trace(
        metadata={"key":"value"}
    )
    # Update observation metadata for the current observation
    langfuse_context.update_current_observation(
        metadata={"key": "value"}
    )
    return

@observe()
def main():
    return nested()

print(main())

Pasted illustration

In addition to metadata, tags can be attached to traces for convenient filtering on the web UI.

from langfuse.decorators import langfuse_context, observe

@observe()
def fn():
    langfuse_context.update_current_trace(
        tags=["tag-1", "tag-2"]
    )

fn()

In real-world cases, a single trace may contain many observations. You can highlight the important ones using the level attribute.

Pasted illustration

To separate traces collected in different contexts, such as production, test, or dev, you can explicitly specify the environment parameter via an environment variable or through langfuse_context.

LANGFUSE_TRACING_ENVIRONMENT="production"

langfuse_context.configure(environment="dev")

Then you can filter traces by the environment label in the web UI.

Pasted illustration

Another way to label traces according to release versions is by attaching a release tag, typically a commit SHA or similar identifier:

LANGFUSE_RELEASE="<release_tag>"

Usually, interactions with LLMs occur as a series of calls grouped by shared context. To group traces from such calls together, simply assign them a common session_id.

from langfuse.decorators import langfuse_context, observe

@observe()
def fn():
    langfuse_context.update_current_trace(
        session_id="your-session-id"
    )

fn()

Traces with a common session_id appear under the Tracing/Sessions tab.

Grouping traces with session ID.

Similarly, a user_id label allows grouping traces by the users associated with them:

from langfuse.decorators import langfuse_context, observe

@observe()
def fn():
    langfuse_context.update_current_trace(
        user_id="user-id"
    )

fn()

This is useful for tracking user usage, budgeting, and more.

Prompt Management

The second most important feature of Langfuse is prompt management. Essentially, it's Git for your prompts. Instead of hard-coding prompts in your application code, you can store and manage them through the Langfuse web UI and retrieve them via API. This offers several significant advantages:

You can modify prompts without deploying a new version of your application.
You can roll back prompts to previous versions or run multiple versions simultaneously for A/B testing.
Non-technical team members can work with prompts through the web UI without touching the codebase.

Prompts are created and managed under the Prompts tab:

Prompt management in Langfuse UI

Or via Langfuse SDK client.

langfuse_client.create_prompt(
    name="Analyst",
    type="text",
    prompt="As a seasoned cybersecurity analyst specializing in {{threat_type}}, what mitigation steps would you take to secure a {{network_type}} environment?",
    labels=["production"],
)

In prompts, you can use {{variables}} that will be replaced with values when the prompt is used. At runtime, you can fetch a prompt, populate it with values, and use it for an LLM call:

from langfuse.decorators import langfuse_context, observe
from langfuse.openai import openai
from langfuse import Langfuse
import os

client = openai.OpenAI()
# Initialize the Langfuse client
langfuse = Langfuse()

@observe()
def analyst(threat_type, network_type):
    prompt = langfuse.get_prompt("Analyst")
    
    # Compile the prompt template with variables
    compiled_prompt = prompt.compile(threat_type=threat_type, network_type=network_type)
    answer = client.chat.completions.create(
        model=os.environ.get("MODEL_NAME") or "gpt-4o",
        messages=[
          {"role": "system", "content": "You are cybersecurity analyst"},
          {"role": "user", "content": compiled_prompt}
        ],
    ).choices[0].message.content
    return answer

print(analyst("malware", "IoT network"))

You can assign labels to prompt versions. By default, get_prompt returns the version labeled production (usually the latest). You can also specify a label explicitly, which is useful for A/B tests.

Prompt versions in Langfuse UI.

from langfuse.decorators import langfuse_context, observe
from langfuse.openai import openai
from langfuse import Langfuse
import os
import random

client = openai.OpenAI()
langfuse = Langfuse()

@observe()
def analyst(threat_type, network_type):
    option_a = langfuse.get_prompt("Analyst", label="opt-a")
    option_b = langfuse.get_prompt("Analyst", label="opt-b")
    prompt = random.choice([option_a, option_b])

    compiled_prompt = prompt.compile(threat_type=threat_type, network_type=network_type)
    answer = client.chat.completions.create(
        model=os.environ.get("MODEL_NAME") or "gpt-4o",
        messages=[
          {"role": "system", "content": "You are cybersecurity analyst"},
          {"role": "user", "content": compiled_prompt}
        ],
    ).choices[0].message.content
    return answer

for i in range(4):
    print(analyst("malware", "IoT network"))

Security

An important aspect of working with LLMs is ensuring protection against various threats, from prompt injections to leaks of personally identifiable information (PII). Langfuse can be used to monitor and safeguard against these risks.

The primary method for mitigating these threats is to pre-filter requests to the LLM and post-filter the responses. Langfuse allows you to track data changes during such filtering, collect related metrics, and evaluate the effectiveness of each method.

A minimal template looks like this:

from langfuse.openai import openai  # OpenAI integration
from langfuse.decorators import observe, langfuse_context
import os


@observe()
def anonymize(inp):
    inp = inp.replace("Alice Smith", "[REDACTED_PERSON1]")
    inp = inp.replace("Bob Johnson", "[REDACTED_PERSON2]")
    inp = inp.replace("Carol Lee",   "[REDACTED_PERSON3]")
    return inp

@observe()
def deanonymize(answer):
    answer = answer.replace("[REDACTED_PERSON1]", "Alice Smith")
    answer = answer.replace("[REDACTED_PERSON2]", "Bob Johnson")
    answer = answer.replace("[REDACTED_PERSON3]", "Carol Lee")
    return answer

@observe()
def summarize_transcript(prompt: str):
    sanitized_prompt = anonymize(prompt)

    answer = openai.chat.completions.create(
        model=os.environ.get("MODEL_NAME") or "gpt-4o",
        max_tokens=100,
        messages=[
            {"role": "system", "content": "Summarize the following project meeting transcript."},
            {"role": "user", "content": sanitized_prompt}
        ],
    ).choices[0].message.content

    sanitized_model_output = deanonymize(answer)

    return sanitized_model_output

prompt = """
Attendees: Alice Smith, Bob Johnson, and Carol Lee.

Alice Smith: "We need to finalize the Q3 roadmap by next Monday."
Bob Johnson: "I've prepared the draft timeline and shared it in the drive."
Carol Lee: "Let's review the milestones and assign owners for each deliverable."
Alice Smith: "Agreed. I'll update the document after this call and circulate it for feedback."
"""

print(summarize_transcript(prompt))

Anonymized PII

De-anonymized PII shown in Langfuse UI. The model never saw the names.

In practice, for pre- and post-processing, you might use libraries such as LLM Guard, PromptArmor, NeMo Guardrails, and Lakera.

Conclusion

Langfuse provides a unified, open-source platform for end-to-end observability, prompt management, and security in your LLM-driven applications. To get the most out of Langfuse, you should:

Install and configure the Langfuse server and set your API keys and LANGFUSE_HOST environment variables;
Instrument your code with the Langfuse SDK: wrap LLM calls with the @observe() decorator (or the various integrations), and configure metadata (tags, levels, session IDs) to trace every request and sub-step;
Explore the web UI under Tracing → Traces and Observations to drill into nested executions, filter by environment, release, or user, and highlight key observations;
Use prompt management to create, version, label, and fetch prompts via API to iterate or A/B test without redeploying your application;
Enforce security best practices by pre- and post-processing inputs/outputs (e.g. anonymization), tracking changes in Langfuse, and monitoring for prompt injections or PII leaks.

3 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

Overview of Langfuse

Setup

Traces and Observations

Prompt Management

Security

Conclusion

Related topics