Does this work with Langfuse?

Yes, it includes specific commands and workflows for leveraging Langfuse for trace retrieval, dataset management, and running evaluation experiments.

What is the 'Golden Dataset' concept?

A Golden Dataset is a curated set of test cases with known expected outcomes used to measure baseline performance, detect regressions, and compare model iterations.

How does this skill help with agent evaluation?

It provides a three-pillar framework covering Output Quality, Process/Trajectory Quality, and Trust & Safety, helping you choose between human, LLM-as-judge, or programmatic methods.

Why use an evaluation-first approach for agents?

Creating evaluations before writing fixes prevents 'solving imaginary problems' and ensures that every change is backed by empirical data and regression testing.

AI Agent Engineering Advisor

Name: AI Agent Engineering Advisor
Author: mberto10

bymberto10

0•

Ciencia de Datos y ML

Provides strategic guidance on building, evaluating, and iterating on AI agents using industry-standard frameworks.

This skill acts as a knowledgeable partner for engineering high-quality AI agents, helping developers navigate the complexities of evaluation strategies, dataset curation, and improvement cycles. Drawing from methodologies by Anthropic, Google, and Manus, it guides users through establishing performance baselines, implementing LLM-as-judge patterns, and leveraging Langfuse traces to create robust 'Golden Datasets' for regression testing and continuous optimization.

Características Principales

01Integration patterns for Langfuse-based monitoring and experiment running

02Evaluation-first improvement methodology to prevent solving imaginary problems

030 GitHub stars

04Golden Dataset curation strategies using synthetic data and production traces

05Tailored guidance for specific agent types like Web Research, RAG, and Support

06Strategic evaluation framework based on Output, Process, and Trust pillars

Casos de Uso

01Defining a robust evaluation strategy for a new multi-step AI agent

02Troubleshooting agent failure modes and designing targeted improvement cycles

03Curating a high-quality test dataset from existing production logs in Langfuse

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound agent-advisor

For use in Claude.ai and ChatGPT

Download Skill