Where does the skill store my test data?

All evaluation definitions, execution logs, and regression baselines are stored locally in your project under the .claude/evals/ directory.

Can I use Eval Harness for qualitative assessments?

Yes, it supports Model-Based Graders where Claude evaluates open-ended output based on specific scoring rubrics, such as code structure, edge-case handling, and error management.

How does the pass@k metric work?

Pass@k measures if the AI can achieve a successful result at least once within 'k' attempts. It is a standard metric for measuring the reliability of stochastic AI models.

What is Evaluation-Driven Development (EDD)?

EDD is a methodology where you define success criteria and evaluation frameworks for the AI before implementation begins, treating evaluations as the 'unit tests' of AI development.

Is Eval Harness compatible with existing CI/CD tools?

While it runs within Claude Code sessions, it can execute standard terminal commands like npm test or grep, allowing it to bridge the gap between AI sessions and your existing test suites.

Eval Harness

Name: Eval Harness
Author: affaan-m

byaffaan-m

•

112,919

•

Security & Testing

Implements Evaluation-Driven Development (EDD) to rigorously test, validate, and track the reliability of AI-generated code changes.

Eval Harness is a formal framework for Claude Code that brings the rigor of unit testing to AI-assisted development through Evaluation-Driven Development (EDD). It allows developers to define success criteria before implementation, track regressions across sessions, and measure reliability using industry-standard metrics like pass@k. By combining deterministic code-based checks, qualitative model-based grading, and human review flags, it ensures that AI contributions are reliable, secure, and ready for production, making it indispensable for complex refactoring and feature development.

Key Features

01Reliability Metrics: Built-in tracking for pass@k and pass^k to statistically measure AI performance consistency.

02Version-Controlled Evaluations: Stores all definitions and logs in a dedicated .claude/evals directory for project auditing.

03Multi-Modal Grading: Supports deterministic Bash/Grep checks and qualitative LLM-based evaluation rubrics.

04Capability Evals: Define specific success criteria for new features before implementation begins.

05112,919 GitHub stars

06Regression Testing: Automated baseline comparisons to ensure new changes don't break existing functionality.

Use Cases

01Quantifying the reliability of a specific prompt or capability across multiple iterations using pass@k metrics.

02Defining strict acceptance criteria for complex logic changes to ensure zero feature regression.

03Automating the verification of security patches by combining automated tests with human review flags.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add affaan-m/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill