概要
The Eval Harness skill introduces Eval-Driven Development (EDD) principles to your Claude Code workflow, treating evaluations as the unit tests of AI development. It allows you to define success criteria before implementation, run continuous capability and regression tests, and measure performance using rigorous metrics like pass@k. By providing structured graders—ranging from deterministic code checks to model-based and human reviews—this skill ensures that AI-driven changes remain reliable, verifiable, and free of regressions throughout the development lifecycle.