01Automated capability and regression testing
02Support for code-based, model-based, and human graders
03Standardized eval reporting and historical logging
040 GitHub stars
05Statistical reliability tracking with pass@k and pass^k metrics
06Eval-Driven Development (EDD) workflow management