概要
This skill empowers Claude to build sophisticated evaluation suites that move beyond vanity metrics to provide actionable insights into AI performance. It covers the full spectrum of evaluation strategies, including deterministic code-based graders, model-based scoring, and human-in-the-loop patterns. By implementing advanced metrics such as pass@k, F1 scores, and iterative recovery rates (the Ralph Pattern), users can scientifically validate agent behavior across coding, research, and computer-use tasks. Whether you are benchmarking MCP servers or optimizing multi-agent coordination, this skill provides the roadmap for building robust, balanced test sets that drive iterative improvement.