01Regression evaluation suite to prevent feature breakage during refactoring
02Support for deterministic code-based graders and probabilistic model-based graders
030 GitHub stars
04Reliability metrics tracking including pass@k and pass^k success rates
05Eval-Driven Development (EDD) workflow for defining success criteria before implementation
06Structured evaluation reporting and local storage in the .claude/ directory