Can I use custom graders?

Yes, the skill supports deterministic code-based graders (like shell scripts), model-based graders using Claude's reasoning capabilities, and manual human review flags for high-risk changes.

What are pass@k metrics?

Pass@k measures the probability that at least one successful result is generated within 'k' attempts, helping developers understand the statistical reliability of an AI's output for a specific task.

How does this skill handle regression testing?

It allows you to define baseline tests that run against new changes, ensuring that added features don't break existing functionality, and reports a pass/fail status relative to previous checkpoints.

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define the expected behavior and success criteria for an AI task before writing the code, providing a clear benchmark for success similar to Test-Driven Development (TDD).

Where are the evaluation files stored?

All evaluations are stored directly in your project under the .claude/evals/ directory, allowing them to be versioned alongside your source code as first-class artifacts.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: AmbushData

byAmbushData

0•

セキュリティとテスト

Implements a formal Eval-Driven Development (EDD) framework to ensure AI-generated code meets quality standards and avoids regressions.

The Eval Harness skill brings Eval-Driven Development (EDD) to the Claude Code environment, treating evaluations as the 'unit tests' for AI development. It enables developers to define success criteria before implementation, run continuous regression checks, and measure reliability using metrics like pass@k. By integrating code-based, model-based, and human graders, this skill provides a rigorous structure for building reliable AI-assisted software while maintaining high confidence in every code change through standardized reporting and performance history.

主な機能

010 GitHub stars

02Multi-modal grading (Code-based, Model-based, and Human review)

03Standardized eval reporting and history tracking

04Eval-Driven Development (EDD) workflow integration

05Advanced reliability metrics including pass@k and pass^k

06Automated capability and regression testing

ユースケース

01Preventing regressions in legacy codebases during AI-assisted refactoring

02Verifying complex feature implementations before merging into production

03Measuring the reliability and consistency of AI-generated logic over multiple iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ambushdata/ai_setup eval-harness

For use in Claude.ai and ChatGPT