01Multi-modal grading including Code-based, Model-based, and Human review
02Structured project storage for eval definitions and run history
03Integrated CLI commands for defining, checking, and reporting evals
04Automated pass@k and pass^k reliability metrics to measure AI consistency
050 GitHub stars
06Standardized Capability and Regression Eval templates for structured testing