01112,914 GitHub stars
02Reliability measurement with pass@k and pass^k metrics
03Multi-modal grading including code-based and model-based graders
04Standardized capability and performance benchmarking
05Eval-Driven Development (EDD) workflow management
06Automated regression testing for agent and prompt stability