Can I export the evaluation results to my task manager?

Yes, if the MCP Linear plugin is available, the skill can automatically create Linear projects and issues based on its findings.

What are the prerequisites for using this skill?

You must have a configured Langfuse setup and an agent configuration file located at .codex/agent-eval/ .yaml within your repository.

Where are the local reports stored?

Reports are generated as Markdown files in the .codex/agent-eval/ /reports/ directory, organized by cycle number.

How does the trace analysis work?

The skill uses langfuse-trace-analysis to examine representative failures and compares them against successful traces with similar inputs to pinpoint the root cause of errors.

Does this skill automatically apply code fixes?

No, the skill is designed to analyze and recommend. It documents findings and suggests specific fixes at identified file paths, but it does not auto-apply them to maintain safety.

Langfuse Agent Evaluator

Name: Langfuse Agent Evaluator
Author: mberto10

bymberto10

Analíticas y Monitorización

Orchestrates end-to-end evaluation cycles for AI agents using Langfuse to identify performance regressions and generate actionable optimization reports.

Acerca de

The Langfuse Agent Evaluator is a specialized skill designed to bring rigorous observability and testing to AI agent development. It automates a multi-phase workflow that includes running dataset experiments with configured judges, performing deep-dive root cause analysis on failed traces, and comparing performance across different development cycles. By identifying specific failure patterns and symptoms, it provides structured recommendations for fixes without the risk of auto-applying unverified changes, ensuring developers have high-quality documentation and clear paths to improvement via Linear or local reports.

Características Principales

Structured fix recommendations with impact and complexity assessments
Comparative trace analysis between successful and failed runs
0 GitHub stars
Multi-format reporting including Linear project issues or local Markdown summaries
Automated experiment execution using Langfuse datasets and judges
Systematic failure analysis grouping by dimension and symptom

Casos de Uso

Benchmarking AI agent performance against a golden dataset before production deployment
Performing root-cause analysis on edge-case failures through detailed trace comparisons
Tracking and documenting agent improvement cycles over time for stakeholder reporting

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound langfuse-agent-eval

For use in Claude.ai and ChatGPT

Download Skill

GitHub