01Automated capability and regression testing suites
02Reliability tracking using pass@k and pass^k metrics
03Standardized evaluation reporting and project-level storage
040 GitHub stars
05Evaluation-Driven Development (EDD) framework implementation
06Support for code-based, model-based, and human evaluators