01Automates the generation of comprehensive evaluation reports and logs
021 GitHub stars
03Implements Eval-Driven Development (EDD) for structured AI coding
04Calculates pass@k and pass^k reliability metrics for performance tracking
05Supports Capability and Regression evaluations to track progress
06Combines deterministic code-based graders with LLM-based qualitative grading