01Support for code-based, model-based, and human-in-the-loop graders
02Integrated workflow for defining, checking, and reporting evals
03Advanced reliability metrics including pass@k and pass^k calculations
04Standardized capability and regression eval templates
051 GitHub stars
06Persistent storage of eval history and baselines in the repository