What is the difference between pass@k and pass^k?

pass@k measures the probability of at least one success in 'k' attempts (useful for creative tasks), while pass^k requires all 'k' attempts to succeed, setting a higher bar for critical code paths.

Does this require manual grading?

No, the skill supports three types of graders: automated code-based graders (deterministic), model-based graders (AI-driven), and human-in-the-loop review for high-risk changes.

How does this skill handle regression testing?

It allows you to set baselines for existing functionality. Whenever changes are made, the harness runs regression evals to ensure the AI hasn't introduced new bugs or broken previous features.

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define expected AI behavior and success criteria before implementation, using these 'evals' as benchmarks to guide the AI's development process.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: Aventerica89

byAventerica89

0•

보안 및 테스팅

Implements a formal eval-driven development framework to measure, test, and improve AI coding reliability through structured evaluations.

The Eval Harness skill integrates Eval-Driven Development (EDD) principles directly into your Claude Code workflow, treating evaluations as the essential unit tests for AI-assisted development. By defining success criteria before implementation, users can track feature capabilities and regression risks through deterministic code-based graders, model-based qualitative reviews, and pass@k metrics. This structured approach ensures that AI-generated code is not only functional but meets specific, repeatable quality standards across multiple sessions.

주요 기능

01Standardized evaluation reporting and status checking

02Capability and regression evaluation templates

03Multi-mode grading via code, AI models, or human review

040 GitHub stars

05Integrated file-based storage for evaluation history

06Reliability tracking using pass@k and pass^k metrics

사용 사례

01Measuring the consistency and reliability of Claude's output across multiple attempts

02Defining and validating success criteria for complex features before writing code

03Running automated regression checks to ensure AI refactors don't break existing logic

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add aventerica89/bricks-builder-agent eval-harness

For use in Claude.ai and ChatGPT

주요 기능

01Standardized evaluation reporting and status checking

02Capability and regression evaluation templates

03Multi-mode grading via code, AI models, or human review

040 GitHub stars

05Integrated file-based storage for evaluation history

06Reliability tracking using pass@k and pass^k metrics

사용 사례

01Measuring the consistency and reliability of Claude's output across multiple attempts

02Defining and validating success criteria for complex features before writing code

03Running automated regression checks to ensure AI refactors don't break existing logic

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add aventerica89/bricks-builder-agent eval-harness

For use in Claude.ai and ChatGPT