What is the primary purpose of the Evaluation Design skill?

It provides a structured workflow to define metrics, select datasets, and design grading strategies for AI agents to ensure optimization efforts are measurable and stable.

Can I use this with Langfuse?

Yes, the skill includes specific integrations for Langfuse dataset setup, management, and annotation workflows to streamline your evaluation pipeline.

Does this support LLM-as-judge evaluation?

Yes, it provides structured templates for rubric-driven LLM grading, bias mitigation, and calibration with human review.

What types of metrics does this skill help define?

It helps you build a metrics matrix that tracks primary success metrics, constraint metrics (like latency, cost, and safety), and secondary quality metrics.

Evaluation Design

Name: Evaluation Design
Author: mberto10

bymberto10

Ciencia de Datos y ML

Designs structured evaluation frameworks for AI agents including metrics, datasets, and automated grading strategies.

Acerca de

Evaluation Design is a systematic skill for Claude Code that enables developers to build rigorous testing frameworks for AI agents and LLM applications. It guides users through a decision-driven workflow to create a comprehensive metrics matrix, select strategic datasets from production or synthetic sources, and implement automated grading systems such as LLM-as-judge. By establishing a clear evaluation specification before starting optimization loops, this skill ensures that agent performance improvements are measurable, stable, and aligned with production requirements.

Características Principales

Direct integration with Langfuse for dataset and prompt management
0 GitHub stars
Strategic dataset sourcing for production traces and edge case coverage
Multi-tier metrics matrix for primary, constraint, and secondary goals
Hybrid grading strategies combining rules and LLM-as-judge rubrics
Standardized evaluation specs for optimization journals

Casos de Uso

Designing an automated regression testing suite for LLM outputs
Defining success criteria and guardrails for a new agentic workflow
Establishing a performance baseline before starting optimization loops

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound evaluation-design

For use in Claude.ai and ChatGPT

Download Skill

GitHub