01Reliability Metrics: Built-in tracking for pass@k and pass^k to statistically measure AI performance consistency.
02Version-Controlled Evaluations: Stores all definitions and logs in a dedicated .claude/evals directory for project auditing.
03Multi-Modal Grading: Supports deterministic Bash/Grep checks and qualitative LLM-based evaluation rubrics.
04Capability Evals: Define specific success criteria for new features before implementation begins.
05112,919 GitHub stars
06Regression Testing: Automated baseline comparisons to ensure new changes don't break existing functionality.