What are the most important factors in agent performance?

Research indicates that token usage (80%) and tool call frequency (10%) are the primary drivers of success in complex browsing and reasoning tasks.

What is complexity stratification in agent testing?

It is the process of organizing test sets into levels—from simple single-tool lookups to complex multi-step reasoning—to ensure the agent is tested across its full operational range.

What makes evaluating AI agents different from traditional software?

Agents are non-deterministic and can take multiple valid paths to a solution, requiring outcome-focused rubrics rather than fixed assertion-based tests.

How does LLM-as-judge work in this context?

It utilizes a high-capability LLM to grade agent outputs against a specific multi-dimensional rubric, providing scalable, consistent, and structured qualitative feedback.

Agent Performance Evaluation

Name: Agent Performance Evaluation
Author: guanyang

byguanyang

•

124

セキュリティとテスト

Establishes robust frameworks for measuring, testing, and optimizing AI agent performance through multi-dimensional rubrics and LLM-as-judge methodologies.

概要

The Evaluation skill provides a systematic approach to assessing complex agent systems where traditional software testing often falls short. It addresses the unique challenges of non-determinism and context-dependent failures by offering outcome-focused methodologies, including multi-dimensional scoring rubrics and LLM-as-judge automation. By focusing on factors like factual accuracy, tool efficiency, and complexity stratification, this skill enables developers to build quality gates, validate context engineering strategies, and implement continuous evaluation pipelines that ensure agents maintain high standards of reliability and efficiency throughout their lifecycle.

主な機能

Complexity-stratified test set creation for diverse scenarios
Multi-dimensional rubric design (accuracy, completeness, tool efficiency)
LLM-as-judge implementation for scalable automated grading
Context engineering validation and performance degradation testing
Continuous evaluation pipelines for proactive regression detection
124 GitHub stars

ユースケース

Benchmarking different LLM models or agent architectures for specific domain tasks
Building automated quality gates in CI/CD pipelines to prevent agent regressions
Optimizing token budgets and context windows based on empirical performance data

概要

主な機能

Complexity-stratified test set creation for diverse scenarios
Multi-dimensional rubric design (accuracy, completeness, tool efficiency)
LLM-as-judge implementation for scalable automated grading
Context engineering validation and performance degradation testing
Continuous evaluation pipelines for proactive regression detection
124 GitHub stars

ユースケース

Benchmarking different LLM models or agent architectures for specific domain tasks
Building automated quality gates in CI/CD pipelines to prevent agent regressions
Optimizing token budgets and context windows based on empirical performance data