Can I use this for non-coding agents?

Yes, the skill includes specific patterns and YAML templates for conversational support agents, research agents, and computer-use agents.

Does it support automated grading?

Yes, it provides guidance for code-based automated verification (unit tests, state checks) as well as model-based rubric scoring for more nuanced tasks.

What metrics does this skill help track?

It provides structures for tracking performance metrics like pass@k, pass^k, latency (TTFT), token usage, and tool call frequency.

What is the difference between capability and regression evals?

Capability evals measure what an agent can do as it improves (starting with low pass rates), while regression evals ensure previously working features don't break (aiming for near 100% pass rates).

Anthropic Agent Evaluations

Name: Anthropic Agent Evaluations
Author: dwmkerr

bydwmkerr

•

Security & Testing

Implements rigorous evaluation frameworks and testing patterns for AI agents based on Anthropic's engineering standards.

This skill provides a comprehensive toolkit for building, grading, and benchmarking AI agents within Claude Code. It streamlines the creation of evaluation suites using Anthropic's proven patterns for various agent types, including coding, conversational, and research agents. By offering templates for code-based, model-based, and human graders, as well as specific metrics like pass@k and pass^k, it helps developers transition from initial capability testing to robust regression suites, ensuring high-quality and reliable agentic workflows.

Key Features

012 GitHub stars

02Standardized evaluation templates for coding and conversational agents

03Integration guidance for frameworks like Harbor, Promptfoo, and Braintrust

04Advanced reliability metrics tracking like pass@k and pass^k for non-deterministic outputs

05Multi-modal grading systems including code-based, model-based, and human review logic

06Structured roadmap for building evaluation harnesses from scratch

Use Cases

01Benchmarking a coding agent's ability to fix security vulnerabilities using unit test graders

02Measuring the reliability and consistency of research agents across multiple trial runs

03Creating a regression test suite for a customer support chatbot to prevent performance dips

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add dwmkerr/claude-toolkit anthropic-evaluations

For use in Claude.ai and ChatGPT

Download Skill