01Automated experiment execution against Langfuse datasets and live production traces
02Advanced run comparison and failure analysis with human annotation integration
03Canonical score normalization (0-1) for consistent evaluation across different scales
04Concurrent execution support for high-throughput testing and evaluation workflows
050 GitHub stars
06Support for versioned LLM-as-judge prompts stored and managed in Langfuse