How does the Eval Harness track reliability?

It uses metrics like pass@k (success within k attempts) and pass^k (consecutive successes) to provide statistical confidence in the AI's ability to handle specific tasks.

What is Eval-Driven Development (EDD)?

EDD treats evaluations as 'unit tests for AI development,' where you define expected behavior and success criteria before implementation to guide the AI and measure its performance.

Can I use manual review with this skill?

Yes, the skill supports three grader types: Code-Based (deterministic scripts), Model-Based (AI self-evaluation), and Human Grader (flagging for manual review).

Where are the evaluations stored?

All eval definitions, run logs, and regression baselines are stored as first-class artifacts within your project's .opencode/ directory.

Eval Harness

Name: Eval Harness
Author: monch1962

bymonch1962

•

Security & Testing

Implements an Eval-Driven Development (EDD) framework to measure and improve AI coding reliability through structured testing and metrics.

The Eval Harness skill brings rigorous Eval-Driven Development (EDD) principles to your coding sessions, treating evaluations as the primary unit tests for AI-generated code. It enables developers to define success criteria before implementation, run automated code-based or model-based graders, and track reliability using metrics like pass@k. By organizing capability and regression evals within the .opencode/ directory, this skill ensures that new features meet expectations without breaking existing functionality, providing a professional-grade workflow for building robust AI-driven applications.

Key Features

01Structured Eval-Driven Development (EDD) Workflow

02Automated Eval Reporting and History Logging

03Multi-modal Grading (Code-based, Model-based, and Human)

041 GitHub stars

05Capability and Regression Eval Framework

06Reliability Metrics tracking including pass@k and pass^k

Use Cases

01Validating new feature implementations against predefined success criteria.

02Measuring the reliability of AI-generated solutions across multiple iterations.

03Preventing regressions in complex codebases during AI-assisted refactoring.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add monch1962/everything-opencode eval-harness

For use in Claude.ai and ChatGPT

Download Skill