01Structured EDD workflow (Define, Implement, Evaluate, Report)
021 GitHub stars
03Multi-modal grading including code-based and model-based checks
04Persistent evaluation storage and historical run logging
05Capability and regression evaluation templates
06Reliability metrics tracking with pass@k and pass^k scoring