01Standardized templates for Capability and Regression evaluations
029 GitHub stars
03Automated evaluation workflows for definition, implementation, and reporting
04Statistical reliability tracking via pass@k and pass^k metrics
05Multi-modal grading including deterministic code-based and model-based graders
06Project-level eval storage for version-controlled testing artifacts