01Support for pass@k and pass^k reliability metrics
02Formalized Eval-Driven Development (EDD) workflow
03Standardized templates for capability and regression evaluations
04Automated regression testing with baseline tracking
051 GitHub stars
06Multi-modal grading including code-based and model-based evaluators