01Multi-modal grading including deterministic code checks and AI-based reviews
02Integrated CLI workflow for defining, checking, and reporting evals
03Standardized storage for eval definitions, logs, and baselines within projects
04Reliability tracking using pass@k and pass^k performance metrics
050 GitHub stars
06Automated capability and regression evaluation suites for AI tasks