01Regression testing with baseline tracking and versioning
020 GitHub stars
03Standardized eval artifact storage and reporting within the project
04Eval-Driven Development (EDD) workflow management
05Automated pass@k and pass^k reliability metrics
06Multi-modal grading including Code, Model-as-judge, and Human review