01Reliability tracking using pass@k and pass^k metrics
021 GitHub stars
03Automated evaluation report generation for project stakeholders
04Standardized templates for Capability and Regression evaluations
05Multi-modal grading including Code-based, Model-based, and Human review
06Version-controlled eval storage within the .claude directory