How does the pass@k metric work?

The pass@k metric measures the probability that an AI agent will succeed at a task at least once within 'k' attempts, providing a more accurate reliability score than a single test.

Where are the evaluation results stored?

All evaluation definitions, history logs, and baselines are stored locally within your project's .claude/evals/ directory.

When should I use a Model-Based Grader vs a Code-Based Grader?

Use Code-Based Graders for deterministic outcomes like build success or syntax. Use Model-Based Graders (Claude) for open-ended qualities like code readability, structure, and logic.

Can I use automated scripts for grading?

Yes, the Eval Harness supports Code-Based Graders that use grep, bash scripts, and test runners like npm test to provide deterministic pass/fail results.

What is Eval-Driven Development (EDD)?

EDD is a methodology where developers define expected AI behaviors as 'evals' before implementation, functioning similarly to unit tests but specifically for AI agent outputs.

Eval Harness (EDD Framework)

Name: Eval Harness (EDD Framework)
Author: xu-xiang

byxu-xiang

•

323

•

安全与测试

Implements Eval-Driven Development (EDD) principles to systematically evaluate AI-generated code and agent performance within Claude Code.

The Eval Harness skill provides a formal framework for Eval-Driven Development (EDD), allowing developers to treat evaluations as unit tests for AI-assisted workflows. It enables Claude to define expected behaviors before implementation, run capability and regression tests, and measure reliability using pass@k metrics. By supporting code-based, model-based, and human-in-the-loop graders, this skill ensures that AI agents remain consistent, reliable, and performant across model updates and code changes.

主要功能

01Standardized templates for Capability and Regression evaluations

02Reliability tracking using pass@k and pass^k metrics

03323 GitHub stars

04Persistent local storage for eval definitions, logs, and baselines

05Multi-modal grading via code scripts, model-based evaluation, and human review

06Automated evaluation workflows (Define, Implement, Eval, Report)

使用场景

01Creating regression test suites to prevent AI-driven functional regressions

02Setting up Eval-Driven Development (EDD) for complex AI-assisted coding tasks

03Benchmarking AI agent performance across different model versions or prompts

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add xu-xiang/everything-claude-code-zh eval-harness

For use in Claude.ai and ChatGPT

主要功能

01Standardized templates for Capability and Regression evaluations

02Reliability tracking using pass@k and pass^k metrics

03323 GitHub stars

04Persistent local storage for eval definitions, logs, and baselines

05Multi-modal grading via code scripts, model-based evaluation, and human review

06Automated evaluation workflows (Define, Implement, Eval, Report)

使用场景

01Creating regression test suites to prevent AI-driven functional regressions

02Setting up Eval-Driven Development (EDD) for complex AI-assisted coding tasks

03Benchmarking AI agent performance across different model versions or prompts

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add xu-xiang/everything-claude-code-zh eval-harness

For use in Claude.ai and ChatGPT