Can I use this for testing terminal-based agents?

Yes, this skill is specifically designed for agentic terminal user interface tools and similar agent-driven workflows.

How does this skill handle non-deterministic AI outputs?

It uses the [L] identifier for LLM-as-judge validation, utilizing specific, objective rubrics to assess quality where code-based checks are insufficient.

What is the difference between [C] and [G] validation types?

[C] represents deterministic code-based validation logic, while [G] indicates validation against a specific ground truth output.

What is Spec-Test-Driven Development (STDD)?

STDD is a development methodology where specifications and evaluation tests are defined upfront to drive the creation and refinement of agentic tools.

Does this skill help with reasoning trace analysis?

Absolutely. The rubric.md template is specifically designed to create criteria for evaluating an agent's reasoning process and decision-making logic.

Agent Evaluation Suite

Name: Agent Evaluation Suite
Author: craigtkhill

bycraigtkhill

0•

安全与测试

Standardizes the creation and maintenance of evaluation suites for AI agents using structured rubrics and validation patterns.

The Evaluation Skill provides a robust framework for building and managing comprehensive evaluation suites for agentic tools following a Spec-Test-Driven Development (STDD) process. It enables developers to define clear success criteria through standardized spec.md and rubric.md files, facilitating a hybrid validation approach that combines deterministic code-based checks with qualitative LLM-as-judge assessments. This skill is essential for ensuring AI agents meet specific domain requirements, maintain high-quality reasoning traces, and remain reliable throughout the development lifecycle.

主要功能

01Structured requirement identification using unique REQ-EVAL naming conventions

02Objective pass/fail criteria definitions for qualitative reasoning assessments

03Hierarchical categorization of evaluation requirements for complex agent behaviors

040 GitHub stars

05Standardized templates for evaluation specifications and reasoning rubrics

06Support for Ground Truth, Code-based, and LLM-as-judge validation types

使用场景

01Refining reasoning trace rubrics to improve the accuracy of LLM-as-judge assessments

02Migrating ad-hoc testing processes to a standardized Spec-Test-Driven Development framework

03Building a new evaluation suite to benchmark the performance of a custom AI agent

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add craigtkhill/stdd-agents evaluation

For use in Claude.ai and ChatGPT

Download Skill