What is Eval-Driven Development (EDD)?

EDD is a methodology where you define success criteria and evaluations before writing code, treating evals as 'unit tests for AI' to ensure consistent behavior.

How does Eval Harness measure reliability?

It uses pass@k metrics to track how often a model succeeds within a specific number of attempts and pass^k to measure how often it succeeds consecutively.

Does it support manual review?

Yes, it includes specific markers for 'Human Review Required' for security-critical or subjective tasks that cannot be fully automated.

Where are the evaluation files stored?

Evaluations are stored as version-controlled artifacts within your project under the .claude/evals directory.

Can I use this for existing projects?

Yes, you can define regression evals for your existing codebase to ensure Claude doesn't introduce bugs during refactoring or feature additions.

Eval Harness

Name: Eval Harness
Author: arabicapp

byarabicapp

•

安全与测试

Implements Eval-Driven Development (EDD) principles within Claude Code to ensure reliable, regression-free AI-generated code.

Eval Harness is a specialized framework designed to bring rigor to AI development by treating evaluations as unit tests for Claude. It allows developers to define expected behaviors before implementation, track regressions across sessions, and measure reliability using industry-standard metrics like pass@k. By integrating automated code-based graders with model-based qualitative evaluation, it helps teams build more robust AI-assisted workflows and ensures that new features don't break existing functionality.

主要功能

01Automates the generation of comprehensive evaluation reports and logs

021 GitHub stars

03Implements Eval-Driven Development (EDD) for structured AI coding

04Calculates pass@k and pass^k reliability metrics for performance tracking

05Supports Capability and Regression evaluations to track progress

06Combines deterministic code-based graders with LLM-based qualitative grading

使用场景

01Measuring the reliability of different prompts or agent strategies using pass@k metrics

02Defining success criteria before prompting Claude for complex feature implementations

03Running automated regression tests to ensure new code doesn't break legacy features

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add arabicapp/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

主要功能

01Automates the generation of comprehensive evaluation reports and logs

021 GitHub stars

03Implements Eval-Driven Development (EDD) for structured AI coding

04Calculates pass@k and pass^k reliability metrics for performance tracking

05Supports Capability and Regression evaluations to track progress

06Combines deterministic code-based graders with LLM-based qualitative grading

使用场景

01Measuring the reliability of different prompts or agent strategies using pass@k metrics

02Defining success criteria before prompting Claude for complex feature implementations

03Running automated regression tests to ensure new code doesn't break legacy features

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add arabicapp/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT