01Structured Eval-Driven Development (EDD) Workflow
02Automated Eval Reporting and History Logging
03Multi-modal Grading (Code-based, Model-based, and Human)
041 GitHub stars
05Capability and Regression Eval Framework
06Reliability Metrics tracking including pass@k and pass^k