01Standardized evaluation lifecycle (Define, Implement, Evaluate, Report)
02Automated pass@k and pass^k reliability metrics
03Version-controlled evaluation storage in .claude/evals/
04Multi-modal grading (Deterministic Code, LLM-as-a-Judge, and Human Review)
05Capability and regression testing suites for AI tasks
06156,032 GitHub stars