How does this skill help with regression testing?

It allows you to set baselines for existing functionality and run automated checks during the development process to ensure new changes don't break established features.

What are pass@k and pass^k metrics?

pass@k measures the probability of success within k attempts, while pass^k requires consecutive successes to ensure high consistency for critical paths.

What is Evaluation-Driven Development (EDD)?

EDD is a methodology where expected behaviors and tests are defined before AI implementation begins, acting as unit tests for AI-driven development to ensure high reliability.

Does it support automated code-based testing?

Yes, it includes code-based graders that use tools like grep, npm test, and build commands to provide deterministic pass/fail results.

Eval Harness (EDD Framework)

Name: Eval Harness (EDD Framework)
Author: unju-ai

byunju-ai

0•

セキュリティとテスト

Implements an Evaluation-Driven Development (EDD) framework for Claude Code to ensure reliable AI-generated code through systematic testing and regression tracking.

The Eval Harness skill introduces a formal evaluation framework for Claude Code sessions, shifting AI-assisted development toward Evaluation-Driven Development (EDD) principles. It enables developers to define rigorous success criteria before implementation, track functional regressions across iterations, and quantify AI performance using metrics like pass@k. By combining deterministic code-based graders with model-based and human-in-the-loop reviews, it provides a robust infrastructure for building production-ready AI features and verifying complex software changes with high confidence.

主な機能

01Generates detailed evaluation reports and maintains project-level history

02Multi-modal grading including Code-based, Model-based, and Human review

03Calculates reliability metrics such as pass@k and pass^k

04Automates capability and regression evaluation testing

05Supports Evaluation-Driven Development (EDD) pre-coding workflows

060 GitHub stars

ユースケース

01Preventing breaking changes when refactoring complex legacy codebases

02Measuring the reliability and consistency of multi-step AI development tasks

03Defining functional requirements and success criteria before AI starts implementation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add unju-ai/ecc eval-harness

For use in Claude.ai and ChatGPT