01Standardized eval reporting and file-based storage in .claude/evals/
02Automated pass@k and pass^k reliability metric tracking
030 GitHub stars
04Support for deterministic code-based and probabilistic model-based graders
05Formal Eval-Driven Development (EDD) workflow integration
06Regression testing suites to prevent agent performance decay