What is Evaluation-Driven Development (EDD)?

EDD treats evaluations as unit tests for AI, defining expected behavior before implementation and tracking reliability metrics throughout the development process.

What is the difference between code-based and model-based graders?

Code-based graders use scripts for objective passes (like build success), while model-based graders use Claude to evaluate qualitative aspects like code structure and error handling.

Where are the evaluation definitions stored?

Evaluations are stored locally within your project in the .claude/evals/ directory, allowing them to be versioned alongside your source code.

How does this skill measure reliability?

It uses metrics like pass@1 (success on the first try) and pass@k (success within k attempts) to quantify how consistently the AI performs specific tasks.

Can I use existing test runners with this skill?

Yes, the framework integrates with bash-based tools like npm test or grep to perform deterministic checks on your codebase automatically.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: oabdelmaksoud

byoabdelmaksoud

0•

セキュリティとテスト

Implements Evaluation-Driven Development (EDD) to formalize and measure the reliability of AI-generated code through structured testing frameworks.

The eval-harness skill transforms Claude Code sessions into a rigorous development environment by applying Evaluation-Driven Development (EDD) principles. It allows developers to define success criteria before coding, perform deterministic code-based checks, and use model-based grading for qualitative outputs. By tracking metrics like pass@k and pass^k, it provides a scientific approach to measuring AI reliability, ensuring that new features meet requirements without introducing regressions into existing functionality. This framework ensures that AI development follows the same high standards of quality assurance as traditional software engineering.

主な機能

01Automated regression tracking to prevent breaks in existing functionality

02AI-powered model grading for qualitative code review and structural analysis

03Reliability measurement using pass@k and pass^k metrics

04Pre-implementation evaluation definition to establish clear success criteria

050 GitHub stars

06Deterministic code-based grading using grep, bash scripts, and test runners

ユースケース

01Measuring and improving the success rate of Claude's code generation over multiple attempts

02Automating regression testing during complex refactoring of existing codebases

03Defining feature requirements as tests before starting the implementation phase

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add oabdelmaksoud/superclaw eval-harness

For use in Claude.ai and ChatGPT

主な機能

01Automated regression tracking to prevent breaks in existing functionality

02AI-powered model grading for qualitative code review and structural analysis

03Reliability measurement using pass@k and pass^k metrics

04Pre-implementation evaluation definition to establish clear success criteria

050 GitHub stars

06Deterministic code-based grading using grep, bash scripts, and test runners

ユースケース

01Measuring and improving the success rate of Claude's code generation over multiple attempts

02Automating regression testing during complex refactoring of existing codebases

03Defining feature requirements as tests before starting the implementation phase