01Standardized templates for Capability and Regression evaluations
02Reliability tracking using pass@k and pass^k metrics
03323 GitHub stars
04Persistent local storage for eval definitions, logs, and baselines
05Multi-modal grading via code scripts, model-based evaluation, and human review
06Automated evaluation workflows (Define, Implement, Eval, Report)