How does pass@k measure AI reliability?

Pass@k calculates the probability that an AI agent succeeds at least once within 'k' number of attempts, offering a more accurate reliability metric for non-deterministic AI outputs than a single pass/fail check.

What is Eval-Driven Development (EDD)?

EDD is a software development methodology where evaluation criteria and success metrics are defined before implementation to guide AI agents and provide objective measurement of their success.

Where are the evaluation results stored?

All evaluation definitions, run histories, and baseline benchmarks are stored locally within your project's .claude/evals/ directory for easy versioning and tracking.

Does this support manual human review?

Yes, it includes a Human Grader type designed to flag specific changes for manual review, which is ideal for security-sensitive code or high-stakes UI changes.

Can I use this for regression testing?

Yes, the skill includes specific regression evaluation templates to ensure that new code changes or prompt updates do not break existing project functionality.

EDD Eval Harness

Name: EDD Eval Harness
Author: oabdelmaksoud

byoabdelmaksoud

0•

安全与测试

Implements a formal evaluation framework for Claude Code sessions using eval-driven development principles to measure and ensure AI agent reliability.

The EDD Eval Harness provides a robust framework for managing AI agent performance through Eval-Driven Development (EDD). It allows developers to define success criteria before implementation, run continuous capability and regression tests, and quantify reliability using advanced metrics like pass@k. By integrating code-based, model-based, and human-led grading, this skill ensures that AI-assisted workflows remain stable, performant, and free of regressions across model versions or prompt changes.

主要功能

01Multi-modal grading including deterministic code checks and AI-based reviews

02Integrated CLI workflow for defining, checking, and reporting evals

03Standardized storage for eval definitions, logs, and baselines within projects

04Reliability tracking using pass@k and pass^k performance metrics

050 GitHub stars

06Automated capability and regression evaluation suites for AI tasks

使用场景

01Preventing regressions during complex refactors by running standardized test suites

02Benchmarking Claude's performance across different model versions or system prompts

03Implementing Test-First AI development where success criteria are defined before code is generated

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add oabdelmaksoud/superclaw eval-harness

For use in Claude.ai and ChatGPT

主要功能

01Multi-modal grading including deterministic code checks and AI-based reviews

02Integrated CLI workflow for defining, checking, and reporting evals

03Standardized storage for eval definitions, logs, and baselines within projects

04Reliability tracking using pass@k and pass^k performance metrics

050 GitHub stars

06Automated capability and regression evaluation suites for AI tasks

使用场景

01Preventing regressions during complex refactors by running standardized test suites

02Benchmarking Claude's performance across different model versions or system prompts

03Implementing Test-First AI development where success criteria are defined before code is generated

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add oabdelmaksoud/superclaw eval-harness

For use in Claude.ai and ChatGPT