Acerca de
Evaluation Design is a systematic skill for Claude Code that enables developers to build rigorous testing frameworks for AI agents and LLM applications. It guides users through a decision-driven workflow to create a comprehensive metrics matrix, select strategic datasets from production or synthetic sources, and implement automated grading systems such as LLM-as-judge. By establishing a clear evaluation specification before starting optimization loops, this skill ensures that agent performance improvements are measurable, stable, and aligned with production requirements.