Can this skill help with LLM-as-judge setup?

Yes, it includes specialized workflows for designing rubrics, selecting judge models, and implementing bias mitigation strategies like pairwise comparison and randomized ordering.

What types of metrics does it support?

It supports a comprehensive metrics matrix including primary success metrics (binary/scale), constraints (latency/cost/safety), and secondary quality metrics like readability.

How does it integrate with development tools?

The skill is designed to work seamlessly with Langfuse for dataset and prompt management, and it produces standardized specs for insertion into optimization journals.

What is the primary goal of the Evaluation Design skill?

The skill aims to create a repeatable, measurable framework for testing AI agents, ensuring performance improvements are quantifiable and regressions are avoided during optimization.

Agent Evaluation Design

Name: Agent Evaluation Design
Author: mberto10

bymberto10

•

데이터 과학 및 ML

Architects comprehensive evaluation frameworks for AI agents by defining metrics, datasets, and grading strategies.

This skill provides a systematic approach to designing robust evaluation plans for AI agents prior to optimization. It guides developers through a structured workflow to establish primary and constraint metrics, select representative datasets, and implement automated or human-in-the-loop grading rubrics. By ensuring evaluations are stable and measurable, it allows for high-confidence iterations and prevents regressions during the agent development lifecycle, ultimately providing a ready-to-run evaluation spec for optimization journals.

주요 기능

01Standardized evaluation specs for optimization loop integration

02Rubric-driven grading workflows for LLM-as-judge and hybrid evaluators

03Dataset strategy design covering production traces and synthetic cases

041 GitHub stars

05Structured metrics matrix for primary, constraint, and secondary goals

06Deep integration with Langfuse for dataset and prompt management

사용 사례

01Designing the evaluation plan for a customer support triage agent

02Creating calibration plans to align LLM judges with human experts

03Setting up safety and latency constraints for a production RAG system

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound evaluation-design

For use in Claude.ai and ChatGPT

Download Skill