Does it support manual verification?

Yes, it includes a specific 'Human Grader' type designed to flag complex changes or UI/UX updates that require a manual review before being marked as successful.

Where are the evaluation results stored?

All eval definitions, run histories, and baselines are stored locally within your project in the .claude/evals/ directory, allowing them to be versioned alongside your source code.

Can I automate the evaluation process?

Yes, the skill supports code-based graders using shell commands or test runners (like npm test) to provide deterministic, automated pass/fail results for your AI's output.

How does this skill measure AI reliability?

It uses pass@k metrics, which calculate the probability of success over a specific number of attempts (e.g., pass@3 means at least one success in three tries), helping you gauge model consistency.

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define expected AI behaviors as evaluation criteria (evals) before implementation, similar to how Test-Driven Development (TDD) works in traditional software engineering.

Claude Eval Harness

Name: Claude Eval Harness
Author: ShunmeiCho

byShunmeiCho

0•

セキュリティとテスト

Implements a formal evaluation framework for Claude Code sessions using Eval-Driven Development (EDD) principles.

The Eval Harness Skill empowers developers to treat AI-generated code with the same rigor as traditional software through Eval-Driven Development (EDD). By defining success criteria and expected behaviors before implementation, this skill provides a structured framework for running capability and regression evals, tracking reliability metrics like pass@k, and generating detailed evaluation reports. It is particularly useful for complex feature implementations where ensuring consistent AI performance and preventing regressions is critical, allowing for a systematic 'test-first' approach to AI prompting and coding.

主な機能

01Multi-modal grading including Code-based, Model-based, and Human review

02Structured project storage for eval definitions and run history

03Integrated CLI commands for defining, checking, and reporting evals

04Automated pass@k and pass^k reliability metrics to measure AI consistency

050 GitHub stars

06Standardized Capability and Regression Eval templates for structured testing

ユースケース

01Defining feature requirements as executable tests before starting AI implementation

02Measuring AI model reliability and success rates across multiple attempts (pass@k)

03Continuous regression testing to ensure new AI changes don't break existing logic

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add shunmeicho/claude-config eval-harness

For use in Claude.ai and ChatGPT

Download Skill