概要
Eval Harness introduces Eval-Driven Development (EDD) to your AI-assisted workflow by treating evaluations as the unit tests of AI development. It enables developers to define expected behaviors before implementation, run continuous capability and regression tests, and measure success through robust metrics like pass@k. By providing deterministic code-based graders alongside model-based qualitative assessments, this skill ensures that Claude's contributions are reliable, functional, and free from regressions throughout the development lifecycle.