Implements an Eval-Driven Development (EDD) framework to measure and improve AI coding reliability through structured testing and metrics.
The Eval Harness skill brings rigorous Eval-Driven Development (EDD) principles to your coding sessions, treating evaluations as the primary unit tests for AI-generated code. It enables developers to define success criteria before implementation, run automated code-based or model-based graders, and track reliability using metrics like pass@k. By organizing capability and regression evals within the .opencode/ directory, this skill ensures that new features meet expectations without breaking existing functionality, providing a professional-grade workflow for building robust AI-driven applications.
Key Features
01Structured Eval-Driven Development (EDD) Workflow
02Automated Eval Reporting and History Logging
03Multi-modal Grading (Code-based, Model-based, and Human)
041 GitHub stars
05Capability and Regression Eval Framework
06Reliability Metrics tracking including pass@k and pass^k
Use Cases
01Validating new feature implementations against predefined success criteria.
02Measuring the reliability of AI-generated solutions across multiple iterations.
03Preventing regressions in complex codebases during AI-assisted refactoring.