When should I use pass@k instead of standard accuracy?

Since LLMs are stochastic, pass@k measures the probability of at least one success across multiple trials, which is essential for understanding the true capability of non-deterministic systems.

What is the difference between code-based and model-based graders?

Code-based graders are deterministic, fast, and objective (e.g., regex or test suites), whereas model-based graders use an LLM rubric to provide flexibility and nuance for open-ended tasks.

Which evaluation framework should I choose?

The skill provides a framework decision matrix: DeepEval for Python agents, Braintrust for TypeScript, RAGAS for RAG pipelines, and Phoenix for trace and span analysis.

What is the benefit of the Ralph Pattern for iterative metrics?

The Ralph Pattern uses failures as feedback to calculate a recovery rate, helping you decide if an agent needs a better prompt or if a retry loop is sufficient for production reliability.

How does this skill help with multi-agent system evaluation?

It provides specific metrics to measure coordination failures, such as Handoff Success rates, Communication Efficiency (signal vs. noise), and Role Adherence.

LLM Agent Evaluation Builder

Name: LLM Agent Evaluation Builder
Author: yzavyas

byyzavyas

•

セキュリティとテスト

Designs and implements rigorous evaluation frameworks for LLM agents, multi-agent systems, and prompts using industry-standard metrics.

概要

This skill empowers Claude to build sophisticated evaluation suites that move beyond vanity metrics to provide actionable insights into AI performance. It covers the full spectrum of evaluation strategies, including deterministic code-based graders, model-based scoring, and human-in-the-loop patterns. By implementing advanced metrics such as pass@k, F1 scores, and iterative recovery rates (the Ralph Pattern), users can scientifically validate agent behavior across coding, research, and computer-use tasks. Whether you are benchmarking MCP servers or optimizing multi-agent coordination, this skill provides the roadmap for building robust, balanced test sets that drive iterative improvement.

主な機能

Specialized evaluation patterns for multi-agent coordination and handoff success.
Automated roadmaps for building balanced problem sets and robust test harnesses.
Guidance on selecting optimal grader types (Deterministic, Model-based, or Human).
3 GitHub stars
Implementation of major frameworks including DeepEval, Braintrust, RAGAS, and Phoenix.
Calculation of advanced metrics like pass@k, pass^k, and iterative recovery rates.

ユースケース

Determining the ceiling of an agent's capability through iterative feedback and retry analysis.
Evaluating the coordination and communication efficiency of a multi-agent pipeline.
Benchmarking a coding agent's performance using SWE-bench verified patterns.

概要

主な機能

Specialized evaluation patterns for multi-agent coordination and handoff success.
Automated roadmaps for building balanced problem sets and robust test harnesses.
Guidance on selecting optimal grader types (Deterministic, Model-based, or Human).
3 GitHub stars
Implementation of major frameworks including DeepEval, Braintrust, RAGAS, and Phoenix.
Calculation of advanced metrics like pass@k, pass^k, and iterative recovery rates.

ユースケース

Determining the ceiling of an agent's capability through iterative feedback and retry analysis.
Evaluating the coordination and communication efficiency of a multi-agent pipeline.
Benchmarking a coding agent's performance using SWE-bench verified patterns.