What happens if a dimension is marked as critical?

If a critical dimension regresses below the baseline during an iteration, the optimization loop triggers an immediate rollback regardless of other performance gains.

Do I need an existing Langfuse account to use this?

Yes, this skill is designed to integrate with Langfuse for managing datasets, prompt registries, and trace scores.

Can I evaluate agents without a manual dataset?

Yes, the skill supports a 'Live-Trace' mode which evaluates performance against a sample of recent production traces instead of a fixed dataset.

How are evaluation scores normalized?

The infrastructure uses a standard 0-1 scale. While judge prompts can use other scales (like 0-10), the runtime normalizes these before comparing them against your defined thresholds.

What is an eval contract in this context?

An eval contract is a structured document (JSON/YAML) that defines the quality dimensions, thresholds, and baseline metrics used as a preflight gate for agent optimization loops.

Langfuse Eval Infrastructure

Name: Langfuse Eval Infrastructure
Author: mberto10

bymberto10

0•

セキュリティとテスト

Establishes a standardized evaluation framework for AI agents to measure performance and prevent quality regressions.

Langfuse Eval Infrastructure provides a robust "eval contract" system that bridges agent development and optimization loops. It allows developers to define multi-dimensional quality facets, set performance thresholds, and manage judge prompts within Langfuse or via local traces. By establishing baseline runs and critical regression checks, this skill ensures that agent iterations are data-driven and meet specific quality gates before deployment, preventing accidental degradation of agent performance during the optimization process.

主な機能

01Critical dimension monitoring with automated rollback triggers

02Multi-dimensional metric definition with custom weights and thresholds

03Support for both curated dataset and live-trace evaluation modes

040 GitHub stars

05Baseline performance tracking for iteration comparison

06Automated bootstrapping of Langfuse datasets and judge prompts

ユースケース

01Setting up pre-deployment quality gates for production AI agents

02Standardizing evaluation criteria across a fleet of autonomous agents

03Benchmarking new model versions against historical performance data

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound eval-infrastructure

For use in Claude.ai and ChatGPT

Download Skill