Evaluating Skills with Models FAQs

Question 1

Where are the evaluation results stored?

Accepted Answer

Results are output to the console and automatically appended to the target skill's README.md in a standardized table, creating a permanent historical record of performance.

Question 2

Which models can I evaluate with this skill?

Accepted Answer

The skill supports parallel evaluation across the Claude family, specifically targeting Claude Haiku, Sonnet, and Opus to compare performance, quality, and cost-efficiency.

Question 3

How does the quality-based scoring work?

Accepted Answer

Instead of a simple pass/fail, it uses a 0-100 scale. It evaluates if minimum criteria are met, checks against specific quality benchmarks, deducts points for model-specific pitfalls, and applies weights to each behavior.

Question 4

Is this skill available on Claude.ai?

Accepted Answer

No, this skill is designed exclusively for the Claude Code CLI, as it requires the ability to spawn sub-agents and interact with the local file system.

Question 5

What is the benefit of testing on 'Hard' scenarios?

Accepted Answer

Easy scenarios often don't reveal meaningful differences between models. Prioritizing Hard or Medium scenarios ensures you see where models like Haiku might struggle compared to Opus.

Evaluating Skills with Models

主要功能

使用场景

Evaluating Skills with Models

主要功能

使用场景