How does the pass@k metric help my development workflow?

The pass@k metric measures the probability of getting a successful result within 'k' attempts, helping you quantify the reliability and consistency of AI-generated solutions.

Does this skill support automated testing?

Yes, it includes code-based graders that use bash scripts, grep, and test runners to deterministically verify if the AI's output meets specific technical requirements.

What is Eval-Driven Development (EDD)?

EDD is a methodology that treats AI evaluations as unit tests, defining expected behaviors and success criteria before the AI begins implementation to ensure accuracy.

Can I use Eval Harness for qualitative code reviews?

Yes, the framework includes model-based graders where Claude evaluates qualitative aspects like code structure, error handling, and adherence to best practices.

Eval Harness for EDD

Name: Eval Harness for EDD
Author: ysyecust

byysyecust

•

보안 및 테스팅

Implements an evaluation-driven development framework to measure and improve AI coding reliability through structured testing and metrics.

Eval Harness provides a formal framework for Claude Code users to adopt Eval-Driven Development (EDD), treating evaluations as essential unit tests for AI-assisted coding. By defining success criteria and capability benchmarks before implementation, developers can track progress using deterministic code-based graders, model-based assessments, and pass@k metrics. This skill is indispensable for building high-integrity systems, allowing users to verify functional requirements, prevent regressions, and objectively measure the reliability of AI-generated code across complex development cycles.

주요 기능

01Multi-modal grading including Code-based, Model-based, and Human review

022 GitHub stars

03Capability and regression evaluation templates for structured testing

04Automated pass@k and pass^k reliability metric tracking

05Standardized evaluation reporting and historical session logging

06Native project-level storage integration for persistent benchmarking

사용 사례

01Defining functional requirements as testable evaluations before starting a new feature

02Measuring the reliability of Claude's output for mission-critical code paths across multiple attempts

03Verifying that refactoring or optimizations do not introduce regressions in existing logic

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ysyecust/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill