What is the difference between pass@k and pass^k?

pass@k measures if an AI succeeds at least once in k attempts, while pass^k requires the AI to succeed in every single one of the k trials, indicating a higher reliability threshold.

Can I use this for regression testing?

Yes, Eval Harness is specifically designed to create regression suites that ensure new changes or model updates don't break existing, proven functionality.

How does this skill handle non-deterministic AI outputs?

It uses multiple grader types, including code-based (deterministic checks), model-based (using Claude to evaluate open-ended output), and manual human review to provide a comprehensive view of performance.

Where are the evaluation definitions stored?

Evaluations, logs, and baselines are stored directly within your project's .claude/evals/ directory, allowing them to be version-controlled alongside your code.

What is Evaluation-Driven Development (EDD)?

EDD is a methodology where you define the expected behavior and success criteria for an AI task before writing any code, treating evaluations as unit tests for AI outputs.

Eval Harness

Name: Eval Harness
Author: Fabio29T

byFabio29T

0•

Seguridad y Pruebas

Implements a formal Evaluation-Driven Development (EDD) framework to benchmark, test, and measure the reliability of AI-assisted workflows.

Eval Harness provides a structured approach to AI development by treating evaluation as unit testing for AI. It allows developers to define success criteria before implementation, run automated regression tests, and track performance using industry-standard metrics like pass@k. By supporting deterministic code-based graders, qualitative model-based graders, and human review markers, it ensures that Claude Code sessions produce reliable, high-quality results while preventing regressions across model versions or prompt changes.

Características Principales

01Standardized reporting for capability benchmarks and task completion

02Comprehensive regression testing suites for AI workflows

03Support for deterministic code-based and qualitative model-based graders

04Automated pass@k and pass^k reliability metrics tracking

05Implementation of Evaluation-Driven Development (EDD) principles

060 GitHub stars

Casos de Uso

01Ensuring new code changes don't break existing functionality via regression evals

02Defining clear success criteria for complex AI coding tasks before implementation

03Benchmarking AI agent performance across different LLM versions or prompt iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add fabio29t/everything-claude eval-harness

For use in Claude.ai and ChatGPT

Características Principales

01Standardized reporting for capability benchmarks and task completion

02Comprehensive regression testing suites for AI workflows

03Support for deterministic code-based and qualitative model-based graders

04Automated pass@k and pass^k reliability metrics tracking

05Implementation of Evaluation-Driven Development (EDD) principles

060 GitHub stars

Casos de Uso

01Ensuring new code changes don't break existing functionality via regression evals

02Defining clear success criteria for complex AI coding tasks before implementation

03Benchmarking AI agent performance across different LLM versions or prompt iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add fabio29t/everything-claude eval-harness

For use in Claude.ai and ChatGPT