What metrics does the evaluation rubric track?

The core rubric tracks four primary dimensions: factual accuracy, completeness, citation accuracy, and tool efficiency, scored on a 0 to 1 scale.

How does this skill help manage token costs?

It provides frameworks to measure the correlation between token usage and performance variance, helping you identify if model upgrades are more efficient than increasing token limits.

What is 'LLM-as-judge' in agent evaluation?

It is a pattern where a highly capable LLM is used to automatically evaluate the outputs of another agent based on specific rubrics like factual accuracy and tool efficiency.

Can this be used for regression testing AI agents?

Yes, the skill includes templates for continuous evaluation pipelines that track quality metrics over time to catch performance regressions after code or model changes.

Why use complexity stratification for testing?

It allows you to categorize tests from 'simple' (single tool call) to 'very complex' (extended reasoning), helping identify exactly where an agent's logic begins to fail.

Agent Evaluation Framework

Name: Agent Evaluation Framework
Author: eyadsibai

byeyadsibai

0•

보안 및 테스팅

Evaluates AI agent performance using multi-dimensional rubrics, LLM-as-judge patterns, and complexity-stratified test sets.

The agent-evaluation skill provides a robust framework for assessing the non-deterministic nature of AI agents, moving beyond traditional software testing methods. It offers standardized methodologies for measuring factual accuracy, tool efficiency, and citation correctness through LLM-as-judge architectures and complexity-stratified test sets. By implementing systematic context engineering evaluation and continuous monitoring patterns, this skill helps developers identify performance bottlenecks, optimize token usage, and establish reliable quality benchmarks for LLM-powered systems.

주요 기능

01Complexity stratification to test simple vs. deep reasoning tasks

02Multi-dimensional rubrics for scoring accuracy, completeness, and efficiency

03Context engineering evaluation to identify performance degradation cliffs

04Continuous evaluation pipeline templates for regression tracking

05LLM-as-judge implementation patterns for automated quality assessment

060 GitHub stars

사용 사례

01Monitoring agent quality drops during continuous integration/deployment

02Building automated test frameworks for complex AI agents

03Benchmarking different LLM models for specific tool-calling tasks

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add eyadsibai/ltk agent-evaluation

For use in Claude.ai and ChatGPT

Download Skill