Do I need a Langfuse account to use this skill?

Yes, you need a Langfuse public/secret key and host URL to connect to your datasets and sync your experiment results.

Does this skill support parallel processing?

Yes, you can specify a max concurrency level to speed up the execution of experiments across large datasets.

What is the main purpose of the Langfuse Experiment Runner?

It automates the process of running experiments on Langfuse datasets and evaluating the results using various scoring methods directly from your terminal.

How does the 'LLM-as-judge' feature work?

When using the --use-langfuse-judges flag, the runner automatically discovers and applies judge prompts from your Langfuse project to score your outputs.

Can I use my own custom evaluation logic?

Yes, the skill supports custom evaluator scripts that expose specific evaluation functions, allowing you to define local metrics.

Langfuse Experiment Runner

Name: Langfuse Experiment Runner
Author: mberto10

bymberto10

0•

分析と監視

Facilitates LLM experiment execution and prompt evaluation using Langfuse datasets and automated LLM-as-judge scoring.

Langfuse Experiment Runner allows developers to rigorously test and evaluate LLM prompt or model changes by running batch experiments against Langfuse datasets. It supports both local Python evaluators and cloud-based Langfuse judges, enabling users to compare performance across different versions, analyze failure points, and visualize metrics. This skill streamlines the iterative cycle of prompt engineering by providing structured commands for listing, comparing, and auditing LLM outputs directly within the development workflow.

主な機能

01Support for cloud-based Langfuse LLM-as-judge prompts

02Side-by-side run comparison and failure analysis tools

03Integration for custom local Python evaluator scripts

04Automated experiment execution against Langfuse datasets

050 GitHub stars

06Configurable concurrency for high-volume evaluation tasks

ユースケース

01Benchmarking new model versions against established gold-standard datasets

02Identifying regression and edge-case failures through automated evaluation workflows

03A/B testing different LLM prompts to determine which yields higher accuracy scores

主な機能

01Support for cloud-based Langfuse LLM-as-judge prompts

02Side-by-side run comparison and failure analysis tools

03Integration for custom local Python evaluator scripts

04Automated experiment execution against Langfuse datasets

050 GitHub stars

06Configurable concurrency for high-volume evaluation tasks

ユースケース

01Benchmarking new model versions against established gold-standard datasets

02Identifying regression and edge-case failures through automated evaluation workflows

03A/B testing different LLM prompts to determine which yields higher accuracy scores