What files does this skill produce for my project?

It generates a specialized tests directory containing an eval.sh execution harness and a golden_examples.yaml file for defining your test scenarios.

What makes this different from traditional unit testing?

Unlike unit tests that check code-level logic, this skill evaluates the behavioral decision-making of an AI agent using an LLM to judge if the response meets specific qualitative criteria.

Can I use different LLM models for the judge evaluation?

Yes, the generated evaluation harness is configurable via environment variables to use different backends, including Claude and OpenCode.

How do I handle non-deterministic or flaky test results?

The skill provides a troubleshooting guide to help you tighten your SKILL.md wording, refine your queries, or adjust the judge's success criteria to ensure a low failure rate.

What are 'golden examples' in this context?

Golden examples are curated test cases consisting of a specific context, a user query, and an expected behavioral outcome that serves as the ground truth for the agent.

Agent Skill Behavioral Testing

Name: Agent Skill Behavioral Testing
Author: jrollin

byjrollin

0•

Seguridad y Pruebas

Generates LLM-as-judge behavioral evaluation harnesses and golden examples to validate AI agent skill reliability.

This skill automates the creation of evaluation frameworks for Claude Code skills, utilizing an LLM-as-judge approach to ensure consistent and reliable agent behavior. It streamlines the development of a dedicated testing directory containing evaluation scripts and YAML-based golden examples, allowing developers to benchmark agent responses against specific behavioral categories, command-routing logic, and edge cases. By focusing on behavioral outcomes rather than traditional unit tests, it provides a robust methodology for iterating on skill instructions and verifying that agent logic aligns with intended guardrails and organizational anti-patterns.

Características Principales

01Automated generation of evaluation harnesses and test scenario templates

020 GitHub stars

03Multi-backend support including Claude and OpenCode execution

04Integrated troubleshooting workflow for refining skill prompts and instructions

05LLM-as-judge verification for high-level behavioral pass/fail assessment

06Standardized YAML schema for managing golden example test cases

Casos de Uso

01Validating that a new agent skill correctly handles specific edge cases and anti-patterns

02Benchmarking skill performance and accuracy across different LLM models

03Regression testing complex command-routing skills during prompt engineering iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jrollin/claudio skill-testing

For use in Claude.ai and ChatGPT