What is the benefit of the scorecard output?

The scorecard provides an immediate visual audit of skill health, checking description length, naming conventions, examples, and line counts.

Which models are supported for testing?

It provides specific pass-rate benchmarks for Claude 3 Haiku (70%+), Sonnet (85%+), and Opus (95%+) to ensure consistent cross-model performance.

How does it prevent AI hallucinations?

It enforces anti-fabrication rules that require skills to base outputs on actual tool execution and avoid using unsubstantiated metrics or superlatives.

How does it measure skill activation accuracy?

It calculates true positive and false positive rates by testing skills against a mix of representative prompts and out-of-scope triggers.

What does the Claude Skill Benchmarker do?

It evaluates the quality of AI skills through static checks and performance benchmarking to ensure they meet Anthropic's standards for accuracy and efficiency.

Skill Quality & Benchmarker

Name: Skill Quality & Benchmarker
Author: vinnie357

byvinnie357

•

보안 및 테스팅

Evaluates and benchmarks the quality of Claude Agent Skills using static analysis and performance-driven evaluation methodologies.

This skill provides a comprehensive framework for assessing the effectiveness and reliability of AI Agent Skills within the Claude Code ecosystem. It utilizes a rigorous set of static analysis checks—covering naming conventions, description length, and anti-fabrication rules—alongside a robust evaluation methodology that includes A/B testing and multi-model pass rate targets. By generating automated scorecards and measuring activation accuracy, it helps developers optimize skills for high performance, minimal context waste, and consistent behavior across Haiku, Sonnet, and Opus model tiers.

주요 기능

01Skill activation accuracy tracking (True Positive/False Positive rates)

0211 GitHub stars

03A/B testing for comparing skill versions and workflows

04Static analysis with automated quality scorecards

05Multi-model performance benchmarking for Haiku, Sonnet, and Opus

06Anti-fabrication and tool-validation enforcement

사용 사례

01Optimizing skill descriptions to reduce false negatives and context waste

02Validating new Agent Skills against Anthropic's best practices

03Measuring the impact of model updates on existing skill performance

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add vinnie357/claude-skills claude-skills-benchmark

For use in Claude.ai and ChatGPT

주요 기능

01Skill activation accuracy tracking (True Positive/False Positive rates)

0211 GitHub stars

03A/B testing for comparing skill versions and workflows

04Static analysis with automated quality scorecards

05Multi-model performance benchmarking for Haiku, Sonnet, and Opus

06Anti-fabrication and tool-validation enforcement

사용 사례

01Optimizing skill descriptions to reduce false negatives and context waste

02Validating new Agent Skills against Anthropic's best practices

03Measuring the impact of model updates on existing skill performance

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add vinnie357/claude-skills claude-skills-benchmark

For use in Claude.ai and ChatGPT