About
Eval Harness brings the rigor of software testing to AI development by treating evaluations as the 'unit tests' for Claude Code. It allows developers to implement Eval-Driven Development (EDD) by defining expected behaviors before coding, tracking regressions with baseline SHA comparisons, and measuring success through pass@k metrics. By combining deterministic code-based graders with model-based qualitative assessments, this skill ensures that AI-generated code meets specific capability standards and remains stable throughout the development lifecycle.