Langfuse Experiment Runner FAQs

Question 1

How does the skill handle different scoring scales?

Accepted Answer

The runner uses a canonical 0-1 scale. If an evaluator or judge outputs a different scale (like 0-10), the runner automatically normalizes the results to 0-1 for consistent aggregation and gating logic.

Question 2

What is a Langfuse Judge?

Accepted Answer

A Langfuse Judge is an LLM-as-judge evaluator prompt stored in Langfuse. This skill can auto-discover these prompts to score experiment outputs based on reasoning and criteria defined in your Langfuse prompt management system.

Question 3

Can I run experiments on live production data?

Accepted Answer

Yes. The skill includes a 'Live Mode' that fetches the most recent production traces from Langfuse, creates an ephemeral dataset, and runs your experiment against real-world inputs.

Question 4

Can I use local Python scripts for evaluation?

Accepted Answer

Absolutely. You can provide a local evaluator script with custom Python functions to perform checks like exact matches, regex validation, or complex business logic that doesn't require an LLM.

Question 5

Why should I include annotation comments in my analysis?

Accepted Answer

Human annotation comments often reveal the 'why' behind a score. The skill encourages fetching these comments during analysis to categorize failure themes before diving into technical trace investigations.

Langfuse Experiment Runner

主要功能

使用场景

Langfuse Experiment Runner

主要功能

使用场景