How does the pass@k metric work?

It measures the probability that at least one successful result is achieved within 'k' attempts, which is crucial for gauging the reliability of non-deterministic AI outputs.

Can I use this for existing projects?

Yes, you can define regression evals for your existing codebase to ensure that future changes made by Claude do not break established functionality.

Does this require manual grading?

The harness supports manual review, but it prioritizes deterministic code-based graders (like shell scripts or test runners) and model-based graders for automated workflows.

What is Eval-Driven Development (EDD)?

EDD treats AI evaluations as unit tests, requiring developers to define success criteria and expected behaviors before implementing code with an AI assistant to ensure accuracy.

Where are the evaluation results stored?

Evals and their history are stored as first-class artifacts within your project repository under the .claude/evals/ directory, allowing for version control and tracking.

Eval-Driven Development Harness

Name: Eval-Driven Development Harness
Author: steven715

bysteven715

•

セキュリティとテスト

Implements a formal evaluation framework for Claude Code sessions to enable reliable, test-driven AI development.

The Eval-Driven Development (EDD) Harness integrates rigorous testing principles directly into Claude Code sessions, treating AI-generated outputs as testable units. It allows developers to define success criteria before implementation, utilize various grader types—including deterministic code checks and model-based evaluations—and track critical reliability metrics like pass@k. By centralizing eval definitions and history within the project repository, this skill ensures that AI contributions are functional, high-quality, and free of regressions, making it essential for teams building complex software with AI assistance.

主な機能

01Support for code-based, model-based, and human-in-the-loop graders

02Integrated workflow for defining, checking, and reporting evals

03Advanced reliability metrics including pass@k and pass^k calculations

04Standardized capability and regression eval templates

051 GitHub stars

06Persistent storage of eval history and baselines in the repository

ユースケース

01Measuring and improving the reliability of non-deterministic AI coding tasks

02Defining success criteria before implementation to guide AI-driven feature development

03Automating regression testing to ensure AI changes don't break existing functionality

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add steven715/cpp-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill