01Multi-modal grading including Code-based, Model-based, and Human review
022 GitHub stars
03Capability and regression evaluation templates for structured testing
04Automated pass@k and pass^k reliability metric tracking
05Standardized evaluation reporting and historical session logging
06Native project-level storage integration for persistent benchmarking