What factors most impact agent performance?

Research suggests token usage accounts for 80% of performance variance, followed by the number of tool calls and the specific model choice (e.g., Claude 3.5 Sonnet vs Haiku).

What is the 'LLM-as-judge' approach?

It is a technique where a secondary, often more capable LLM evaluates agent outputs based on structured rubrics, providing scalable and consistent qualitative judgments.

Why can't I use traditional unit tests for AI agents?

Agents are non-deterministic and can find multiple valid paths to a goal. Evaluation must focus on outcomes and process quality rather than exact execution steps or string matching.

What should be included in an agent evaluation rubric?

An effective rubric should be weighted and include instruction following, output completeness, tool efficiency, reasoning quality, and response coherence.

How do I mitigate bias in automated evaluations?

This skill provides strategies to handle position, length, and verbosity biases by using pairwise comparisons, swapped positioning during evaluation, and requiring chain-of-thought justification before scoring.

Agent Evaluation & Quality Assurance

Name: Agent Evaluation & Quality Assurance
Author: NeoLabHQ

byNeoLabHQ

•

300

セキュリティとテスト

Evaluates and optimizes Claude Code agents using multi-dimensional rubrics and LLM-as-judge methodologies to improve reliability and performance.

概要

The Agent Evaluation skill provides a robust framework for assessing the effectiveness of AI agents and Claude Code commands. It addresses the inherent challenges of non-deterministic AI behavior by shifting the focus from rigid execution paths to outcome-based metrics and process quality. By implementing multi-dimensional rubrics—covering instruction following, completeness, reasoning quality, and tool efficiency—developers can systematically identify performance bottlenecks. The skill also incorporates advanced 'LLM-as-judge' techniques with specific strategies to mitigate common biases like position and length bias, ensuring that context engineering choices are validated against real-world complexity levels and token constraints.

主な機能

Multi-dimensional rubric scoring for accuracy, efficiency, and reasoning quality.
LLM-as-judge framework with built-in bias mitigation for position and length.
300 GitHub stars
Outcome-focused metrics (Precision, Recall, F1) for non-deterministic agent paths.
Context engineering validation to identify performance cliffs and degradation.
Stratified test set design covering simple to highly complex multi-turn interactions.

ユースケース

Benchmarking tool-use efficiency and token consumption across different Claude models.
Validating the impact of prompt engineering changes on agent success rates.
Establishing quality baselines for complex multi-agent workflows and autonomous skills.

概要

主な機能

Multi-dimensional rubric scoring for accuracy, efficiency, and reasoning quality.
LLM-as-judge framework with built-in bias mitigation for position and length.
300 GitHub stars
Outcome-focused metrics (Precision, Recall, F1) for non-deterministic agent paths.
Context engineering validation to identify performance cliffs and degradation.
Stratified test set design covering simple to highly complex multi-turn interactions.

ユースケース

Benchmarking tool-use efficiency and token consumption across different Claude models.
Validating the impact of prompt engineering changes on agent success rates.
Establishing quality baselines for complex multi-agent workflows and autonomous skills.