How does this skill help with regression testing?

It provides a 'Failure-to-Task' utility that logs real-world agent failures and automatically converts them into test tasks, ensuring that once a bug is fixed, it stays fixed.

Does it support domain-specific testing?

Yes, it includes pre-configured grader stacks for various domains such as coding, research, conversational AI, and computer use.

What are the three types of graders available?

The skill supports Code-based graders for deterministic checks, Model-based (LLM) graders for nuanced quality rubrics, and Human graders for establishing gold-standard benchmarks.

What metrics does the framework use to measure success?

It utilizes pass@k (the probability of at least one success in k trials) to measure capability and pass^k (the probability that all trials succeed) to measure consistency.

What is the Evals skill used for in Claude Code?

It is a specialized framework designed to test, benchmark, and verify the behavior of AI agents, particularly across complex, multi-turn workflows rather than just single responses.

Agent Evaluation Framework

Name: Agent Evaluation Framework
Author: verrio1

byverrio1

0•

安全与测试

Evaluates AI agent workflows and performance using a comprehensive framework based on Anthropic's best practices.

The Evals skill provides a professional-grade environment for benchmarking AI agents by capturing complete interaction transcripts and applying multi-modal grading methods. It enables developers to move beyond simple output testing by evaluating complex multi-turn tool usage through code-based assertions, model-based rubrics, and human review. By implementing pass@k and pass^k metrics, it allows teams to quantify both agent capability and consistency while seamlessly integrating with regression suites and THE ALGORITHM for verifiable development cycles.

主要功能

010 GitHub stars

02Automated transcript capture for evaluating multi-turn agent trajectories and tool calls

03Failure-to-task pipeline that converts real-world edge cases into regression tests

04Pre-configured domain patterns for coding, research, and conversational agents

05Statistical performance metrics including pass@k for capability and pass^k for reliability

06Multi-modal grading including code-based, LLM-based, and human-in-the-loop reviewers

使用场景

01Calibrating automated LLM judges against human expert judgment for complex reasoning tasks

02Benchmarking new agent capabilities against a library of real-world failure cases

03Establishing high-confidence regression gates (99% pass rate) for production deployments

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add verrio1/vaughn-pai Evals

For use in Claude.ai and ChatGPT

Download Skill