Where are the evaluation files stored?

All evaluation definitions, run history logs, and regression baselines are stored locally in your project under the .claude/evals/ directory.

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define the expected behavior and success criteria for an AI task before implementation, ensuring the model's output meets specific standards.

What types of graders does the Eval Harness support?

It supports deterministic code-based graders using scripts, model-based graders that use Claude to evaluate open-ended output, and manual human review flags.

Can I use this for existing projects?

Yes, you can define regression evals for your existing codebase to ensure that AI-assisted changes do not break legacy functionality.

How does this skill track AI reliability?

It uses pass@k metrics, which measure the probability of success within a specific number of attempts, and pass^k, which requires consecutive successes for critical paths.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: Aventerica89

byAventerica89

0•

セキュリティとテスト

Implements an eval-driven development framework to define, track, and measure the reliability of AI-generated code.

The Eval Harness skill introduces Eval-Driven Development (EDD) principles to your Claude Code workflow, treating evaluations as the unit tests of AI development. It allows you to define success criteria before implementation, run continuous capability and regression tests, and measure performance using rigorous metrics like pass@k. By providing structured graders—ranging from deterministic code checks to model-based and human reviews—this skill ensures that AI-driven changes remain reliable, verifiable, and free of regressions throughout the development lifecycle.

主な機能

01Automated capability and regression testing

02Support for code-based, model-based, and human graders

03Standardized eval reporting and historical logging

040 GitHub stars

05Statistical reliability tracking with pass@k and pass^k metrics

06Eval-Driven Development (EDD) workflow management

ユースケース

01Verifying that refactors do not break existing logic via regression testing

02Defining success criteria for new features before implementation

03Measuring the reliability of complex AI prompts or logic changes over time

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add aventerica89/bricks-builder-agent eval-harness

For use in Claude.ai and ChatGPT

主な機能

01Automated capability and regression testing

02Support for code-based, model-based, and human graders

03Standardized eval reporting and historical logging

040 GitHub stars

05Statistical reliability tracking with pass@k and pass^k metrics

06Eval-Driven Development (EDD) workflow management

ユースケース

01Verifying that refactors do not break existing logic via regression testing

02Defining success criteria for new features before implementation

03Measuring the reliability of complex AI prompts or logic changes over time

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add aventerica89/bricks-builder-agent eval-harness

For use in Claude.ai and ChatGPT