Where are the evaluation results and logs stored?

All eval definitions, execution logs, and regression baselines are stored within your project's .claude/evals/ directory for easy versioning and review.

Can I integrate existing test suites like Jest or Mocha?

Yes, the framework includes code-based graders that can execute bash commands and existing test runners to verify if code changes pass your current test suite.

How does the pass@k metric work in this skill?

Pass@k measures the probability that the AI succeeds at least once within 'k' attempts, helping developers understand the reliability of a specific prompt or task.

What is Eval-Driven Development (EDD)?

EDD is a methodology where AI behavior is defined and tested through formal evaluations before and during the development process, acting as unit tests for AI-generated code.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: oabdelmaksoud

byoabdelmaksoud

0•

Security & Testing

Implements Eval-Driven Development (EDD) principles to systematically evaluate and verify AI-generated code quality and reliability.

The Eval Harness skill provides a robust framework for implementing Eval-Driven Development (EDD) within Claude Code sessions. By treating evaluations as unit tests for AI, it allows developers to define expected behaviors before implementation, track regressions across sessions, and measure reliability using pass@k metrics. It supports diverse grading methods, including deterministic code-based checks, model-based qualitative assessments, and manual human reviews, ensuring that AI-assisted development remains predictable, verifiable, and production-ready.

Key Features

01Automated scoring via code-based, model-based, and human graders

02Structured eval reporting with PASS/FAIL status and success ratios

03Reliability measurement using pass@k and pass^k metrics

04Capability and regression eval definitions for pre-implementation planning

050 GitHub stars

06Version-controlled eval storage within the .claude/evals directory

Use Cases

01Defining success criteria before AI implementation to ensure target outcomes

02Tracking regressions in critical code paths during automated refactoring

03Measuring the consistency and reliability of AI-generated solutions across multiple trials

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add oabdelmaksoud/superclaw eval-harness

For use in Claude.ai and ChatGPT

Key Features

01Automated scoring via code-based, model-based, and human graders

02Structured eval reporting with PASS/FAIL status and success ratios

03Reliability measurement using pass@k and pass^k metrics

04Capability and regression eval definitions for pre-implementation planning

050 GitHub stars

06Version-controlled eval storage within the .claude/evals directory

Use Cases

01Defining success criteria before AI implementation to ensure target outcomes

02Tracking regressions in critical code paths during automated refactoring

03Measuring the consistency and reliability of AI-generated solutions across multiple trials

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add oabdelmaksoud/superclaw eval-harness

For use in Claude.ai and ChatGPT