概要
The Eval Harness skill brings Eval-Driven Development (EDD) to the Claude Code environment, treating evaluations as the 'unit tests' for AI development. It enables developers to define success criteria before implementation, run continuous regression checks, and measure reliability using metrics like pass@k. By integrating code-based, model-based, and human graders, this skill provides a rigorous structure for building reliable AI-assisted software while maintaining high confidence in every code change through standardized reporting and performance history.