01Standardized Eval-Driven Development (EDD) workflow
02Multi-modal grading including code-based, model-based, and human-in-the-loop
03Support for pass@k and pass^k reliability metrics
040 GitHub stars
05Standardized evaluation reporting and local storage formats
06Automated regression testing for agent and prompt versioning