What is Eval-Driven Development (EDD)?

EDD is a methodology where evaluations are defined before code implementation, acting as unit tests for AI behavior to ensure consistent and high-quality results.

Does Eval Harness support manual code reviews?

Yes, it includes a 'Human Review Required' grader type specifically for safety-critical checks or subjective UI/UX evaluations that require human oversight.

How does the pass@k metric work in Eval Harness?

Pass@k measures the probability that at least one success occurs within 'k' attempts. It helps developers understand how reliable a specific AI prompt or logic is.

Can I automate regression testing with this skill?

Yes, Eval Harness includes specific templates for Regression Evals that compare current results against a baseline SHA or checkpoint to prevent breaking changes.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: affaan-m

byaffaan-m

•

43,117

Security & Testing

Implements a formal evaluation framework for Claude Code sessions based on Eval-Driven Development (EDD) principles to ensure reliability.

About

Eval Harness brings the rigor of software testing to AI development by treating evaluations as the 'unit tests' for Claude Code. It allows developers to implement Eval-Driven Development (EDD) by defining expected behaviors before coding, tracking regressions with baseline SHA comparisons, and measuring success through pass@k metrics. By combining deterministic code-based graders with model-based qualitative assessments, this skill ensures that AI-generated code meets specific capability standards and remains stable throughout the development lifecycle.

Key Features

Deterministic code-based grading using Grep, Bash, and test runners
Standardized reporting workflow and version-controlled eval storage
Model-based grading for qualitative assessment of AI outputs
43,117 GitHub stars
Capability and Regression eval templates for structured testing
Reliability tracking using pass@k and pass^k metrics

Use Cases

Measuring the reliability of complex logic by tracking success rates across multiple attempts
Protecting legacy functionality with automated regression checks during AI refactoring
Defining feature requirements as testable evals before starting implementation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add affaan-m/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill

GitHub