01Automated evaluation reporting with status summaries
02Reliability tracking using pass@k and pass^k metrics
030 GitHub stars
04Multi-modal grading including code-based, model-based, and human-led reviews
05Local storage of eval definitions and history within the .claude directory
06Standardized templates for Capability and Regression evaluations