010 GitHub stars
02Provides capability and regression evaluation templates for systematic testing.
03Implements Evaluation-Driven Development (EDD) principles for AI workflows.
04Supports deterministic code-based, model-based, and human-review graders.
05Tracks reliability metrics including pass@k (success rate) and pass^k (stability).
06Generates detailed evaluation reports for auditing AI task completion.