Where are the evaluation results stored?

Results are output to the console and automatically appended to the target skill's README.md in a standardized table, creating a permanent historical record of performance.

Which models can I evaluate with this skill?

The skill supports parallel evaluation across the Claude family, specifically targeting Claude Haiku, Sonnet, and Opus to compare performance, quality, and cost-efficiency.

How does the quality-based scoring work?

Instead of a simple pass/fail, it uses a 0-100 scale. It evaluates if minimum criteria are met, checks against specific quality benchmarks, deducts points for model-specific pitfalls, and applies weights to each behavior.

Is this skill available on Claude.ai?

No, this skill is designed exclusively for the Claude Code CLI, as it requires the ability to spawn sub-agents and interact with the local file system.

What is the benefit of testing on 'Hard' scenarios?

Easy scenarios often don't reveal meaningful differences between models. Prioritizing Hard or Medium scenarios ensures you see where models like Haiku might struggle compared to Opus.

Evaluating Skills with Models

Name: Evaluating Skills with Models
Author: taisukeoe

bytaisukeoe

0•

セキュリティとテスト

Evaluates AI skill performance across multiple Claude models using parallel sub-agents and quality-based scoring.

This skill automates the comparative testing of agentic AI skills by executing them across the Claude model family, including Sonnet, Opus, and Haiku. It utilizes a sophisticated weighting system that moves beyond binary pass/fail metrics to identify specific model pitfalls and calculate quality-based scores. By running scenarios in parallel via sub-agents, it helps developers determine production readiness, identify the most cost-effective compatible model for specific tasks, and document historical performance directly within the repository's README.

主な機能

01Automated README documentation with historical performance tables

02Difficulty-based scenario prioritization for realistic testing

030 GitHub stars

04Parallel execution across Claude Sonnet, Opus, and Haiku models

05Quality-based weighted scoring system (0-100 scale)

06Detection of model-specific pitfalls like over-engineering or shallow reasoning

ユースケース

01Regression testing skills during development to ensure consistent behavior across model updates

02Determining if a newly developed AI skill is ready for production deployment

03Comparing model performance to find the cheapest model that meets quality requirements

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add taisukeoe/agentic-ai-skills-creator evaluating-skills-with-models

For use in Claude.ai and ChatGPT

Download Skill