What is Eval-Driven Development (EDD)?

EDD is a methodology where expected AI behavior is defined through formal evaluations before implementation begins, treating these evals as the unit tests for AI-generated code.

Does this skill support manual human review?

Yes, it includes a Human Grader type designed to flag specific changes for manual review, which is recommended for high-risk areas like security or UI/UX.

How does the code-based grader work?

The code-based grader uses deterministic shell commands, such as grep for patterns or running test runners like npm test, to verify if the AI-generated code meets specific requirements.

Where are evaluation results stored?

Evaluations are stored locally in the project's .claude/evals/ directory, allowing eval definitions and history to be versioned alongside the source code.

What is the difference between pass@1 and pass@k?

pass@1 measures the success rate on the first attempt, while pass@k measures the probability of at least one success within 'k' attempts, providing a better metric for AI reliability.

Eval Harness

Name: Eval Harness
Author: ysyecust

byysyecust

0•

Seguridad y Pruebas

Implements eval-driven development principles to systematically test, track, and validate AI-generated code quality.

Eval Harness is a specialized framework for Claude Code that adopts Eval-Driven Development (EDD), treating evaluations as the 'unit tests' of AI-assisted software engineering. It enables developers to define rigorous success criteria before implementation, perform deterministic code-based grading, and leverage model-based or human-led reviews to ensure high reliability. By tracking metrics like pass@k and monitoring for regressions across sessions, this skill helps developers build more predictable and robust AI-generated features while maintaining a historical record of system performance within the project directory.

Características Principales

01Automated evaluation reporting with status summaries

02Reliability tracking using pass@k and pass^k metrics

030 GitHub stars

04Multi-modal grading including code-based, model-based, and human-led reviews

05Local storage of eval definitions and history within the .claude directory

06Standardized templates for Capability and Regression evaluations

Casos de Uso

01Implementing a 'test-first' workflow specifically optimized for LLM agent interactions

02Ensuring new AI-generated features do not break existing functionality in complex codebases

03Measuring the success rate of complex refactoring tasks through repeated trials

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ysyecust/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill