What types of graders are supported?

The framework supports deterministic code-based graders (e.g., grep, npm test), model-based graders using Claude, and manual human review markers.

How does this skill measure reliability?

It uses pass@k metrics (at least one success in k attempts) and pass^k metrics (success in all k attempts) to quantify the consistency of AI outputs.

Where are the evaluation files stored?

Evaluations are stored within your project in the .claude/evals/ directory, allowing them to be version-controlled alongside your source code.

What is Eval-Driven Development (EDD)?

EDD treats evaluations as AI unit tests, defining expected behaviors before implementation to ensure code quality and prevent regression.

Can I use this for regression testing?

Yes, it specifically includes a workflow for regression evals to ensure new code changes do not break existing functionality or benchmarks.

AI Eval Harness

Name: AI Eval Harness
Author: saar120

bysaar120

0•

Security & Testing

Implements a formal evaluation framework for Claude Code using Eval-Driven Development (EDD) to ensure AI-generated code quality.

The Eval Harness skill introduces Eval-Driven Development (EDD) to the Claude Code environment, treating evaluations as the AI-native equivalent of unit tests. It enables developers to define expected behaviors before implementation, track regressions through automated benchmarks, and measure reliability using industry-standard metrics like pass@k. By providing structured workflows for capability and regression testing—utilizing code-based, model-based, and human graders—it ensures that AI-generated code meets high standards of correctness and stability throughout the development lifecycle.

Key Features

01Implements Eval-Driven Development (EDD) principles for AI-assisted coding

02Generates comprehensive evaluation reports with version-controlled history

03Includes multi-modal scoring via code-based, model-based, and human graders

040 GitHub stars

05Supports automated Capability and Regression evaluation types

06Tracks reliability metrics including pass@k and pass^k indicators

Use Cases

01Measuring the reliability and consistency of Claude’s outputs across multiple attempts

02Defining success criteria for complex features before generating implementation code

03Preventing regression errors in existing logic during iterative AI refactoring sessions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add saar120/skillgate-registry eval-harness

For use in Claude.ai and ChatGPT

Key Features

01Implements Eval-Driven Development (EDD) principles for AI-assisted coding

02Generates comprehensive evaluation reports with version-controlled history

03Includes multi-modal scoring via code-based, model-based, and human graders

040 GitHub stars

05Supports automated Capability and Regression evaluation types

06Tracks reliability metrics including pass@k and pass^k indicators

Use Cases

01Measuring the reliability and consistency of Claude’s outputs across multiple attempts

02Defining success criteria for complex features before generating implementation code

03Preventing regression errors in existing logic during iterative AI refactoring sessions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add saar120/skillgate-registry eval-harness

For use in Claude.ai and ChatGPT