概要
This skill provides a comprehensive toolkit for building, grading, and benchmarking AI agents within Claude Code. It streamlines the creation of evaluation suites using Anthropic's proven patterns for various agent types, including coding, conversational, and research agents. By offering templates for code-based, model-based, and human graders, as well as specific metrics like pass@k and pass^k, it helps developers transition from initial capability testing to robust regression suites, ensuring high-quality and reliable agentic workflows.