Facilitates LLM experiment execution and prompt evaluation using Langfuse datasets and automated LLM-as-judge scoring.
Langfuse Experiment Runner allows developers to rigorously test and evaluate LLM prompt or model changes by running batch experiments against Langfuse datasets. It supports both local Python evaluators and cloud-based Langfuse judges, enabling users to compare performance across different versions, analyze failure points, and visualize metrics. This skill streamlines the iterative cycle of prompt engineering by providing structured commands for listing, comparing, and auditing LLM outputs directly within the development workflow.
主な機能
01Support for cloud-based Langfuse LLM-as-judge prompts
02Side-by-side run comparison and failure analysis tools
03Integration for custom local Python evaluator scripts
04Automated experiment execution against Langfuse datasets
050 GitHub stars
06Configurable concurrency for high-volume evaluation tasks
ユースケース
01Benchmarking new model versions against established gold-standard datasets
02Identifying regression and edge-case failures through automated evaluation workflows
03A/B testing different LLM prompts to determine which yields higher accuracy scores