010 GitHub stars
02Automated transcript capture for evaluating multi-turn agent trajectories and tool calls
03Failure-to-task pipeline that converts real-world edge cases into regression tests
04Pre-configured domain patterns for coding, research, and conversational agents
05Statistical performance metrics including pass@k for capability and pass^k for reliability
06Multi-modal grading including code-based, LLM-based, and human-in-the-loop reviewers