What is Eval-Driven Development (EDD)?

EDD treats evaluations as unit tests for AI, defining expected behavior and success criteria before implementation to ensure tasks are completed accurately and reliably.

How does the pass@k metric work?

It measures the probability that at least one successful result is achieved within 'k' attempts, providing a realistic benchmark for AI reliability in stochastic environments.

Can I use this for regression testing?

Yes, Eval Harness is specifically designed to run regression evaluations, comparing current performance against baselines to ensure new changes haven't broken existing functionality.

What types of graders are supported?

It supports deterministic code-based graders (using Grep or Bash), model-based graders (where an LLM acts as a judge), and manual human-in-the-loop review processes.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: affaan-m

byaffaan-m

•

112,913

•

安全与测试

Implements a formal evaluation framework for Claude Code sessions using Eval-Driven Development (EDD) principles to ensure AI reliability.

Eval Harness is a specialized framework designed to bring engineering rigor to AI-assisted development. By implementing Eval-Driven Development (EDD), it allows developers to define success criteria before implementation, track regressions with precision, and measure agent performance using advanced metrics like pass@k. It supports multiple grading methods—including deterministic code checks, LLM-based qualitative assessments, and manual human reviews—ensuring that Claude's output remains stable and high-quality across model versions and complex task iterations.

主要功能

01Eval-Driven Development (EDD) workflow integration

02Support for pass@k and pass^k reliability metrics

03Multi-modal grading including code-based and model-based scoring

04Automated regression testing and capability benchmarking

05Standardized evaluation reporting and versioned artifact storage

06112,913 GitHub stars

使用场景

01Benchmarking Claude's performance across different model versions or prompt changes

02Measuring the success rate of non-deterministic AI coding tasks using pass@k metrics

03Setting up rigorous 'unit tests' for AI tasks to prevent regressions in complex codebases

主要功能

01Eval-Driven Development (EDD) workflow integration

02Support for pass@k and pass^k reliability metrics

03Multi-modal grading including code-based and model-based scoring

04Automated regression testing and capability benchmarking

05Standardized evaluation reporting and versioned artifact storage

06112,913 GitHub stars

使用场景

01Benchmarking Claude's performance across different model versions or prompt changes

02Measuring the success rate of non-deterministic AI coding tasks using pass@k metrics

03Setting up rigorous 'unit tests' for AI tasks to prevent regressions in complex codebases