Where are the evaluations stored?

Evaluations and logs are stored locally in the .claude/evals/ directory, allowing them to be version-controlled alongside your project source code.

Can I use this for security checks?

Yes, the framework supports deterministic code-based checkers and human review markers specifically designed for sensitive operations and security audits.

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define AI evaluation criteria before implementation, treating them as unit tests for AI-generated code to ensure functional alignment.

How does the pass@k metric work?

It measures the probability that the AI succeeds at least once within 'k' attempts, which is a standard way to track the reliability of non-deterministic AI outputs.

What kind of graders does Eval Harness support?

It supports three types: Code-based graders for deterministic checks, Model-based graders for open-ended output evaluation, and Human graders for manual review.

Eval-Driven Development (EDD) Harness

Name: Eval-Driven Development (EDD) Harness
Author: oabdelmaksoud

byoabdelmaksoud

0•

Security & Testing

Implements a formal evaluation framework for Claude Code sessions based on Eval-Driven Development principles to ensure reliable AI-generated code.

The Eval Harness skill brings Eval-Driven Development (EDD) to your Claude Code workflow, treating evaluations as unit tests for AI behavior. It allows developers to define success criteria before implementation, track regressions across sessions, and measure reliability using industry-standard metrics like pass@k. By supporting code-based, model-based, and human grading, this skill ensures that AI-driven features meet specific functional requirements and security standards without breaking existing codebase stability.

Key Features

01Integrated Workflow: Standardized commands for defining, checking, and reporting on eval status during development.

02Capability Evaluation: Define and test new AI-driven features with specific success criteria.

03Multi-modal Grading: Support for deterministic code checks, model-based evaluations, and human-in-the-loop reviews.

04Regression Testing: Track performance against baselines to prevent feature degradation.

050 GitHub stars

06Reliability Metrics: Measure success using pass@k and pass^k indicators for robust performance tracking.

Use Cases

01Defining success criteria for complex features before Claude begins implementation.

02Measuring the reliability of Claude's output over multiple iterations to ensure consistent production quality.

03Protecting critical code paths from regressions when performing large-scale AI refactoring.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add oabdelmaksoud/agention eval-harness

For use in Claude.ai and ChatGPT

Key Features

01Integrated Workflow: Standardized commands for defining, checking, and reporting on eval status during development.

02Capability Evaluation: Define and test new AI-driven features with specific success criteria.

03Multi-modal Grading: Support for deterministic code checks, model-based evaluations, and human-in-the-loop reviews.

04Regression Testing: Track performance against baselines to prevent feature degradation.

050 GitHub stars

06Reliability Metrics: Measure success using pass@k and pass^k indicators for robust performance tracking.

Use Cases

01Defining success criteria for complex features before Claude begins implementation.

02Measuring the reliability of Claude's output over multiple iterations to ensure consistent production quality.

03Protecting critical code paths from regressions when performing large-scale AI refactoring.