01Automated scoring via code-based, model-based, and human graders
02Structured eval reporting with PASS/FAIL status and success ratios
03Reliability measurement using pass@k and pass^k metrics
04Capability and regression eval definitions for pre-implementation planning
050 GitHub stars
06Version-controlled eval storage within the .claude/evals directory