What models can this skill benchmark?

By default, the skill is configured to benchmark Claude 3.5 Sonnet and Claude 3 Opus, but it can be adapted for any model supported by the underlying runner script.

Does it leave temporary branches or issues in my repository?

No, the skill includes a mandatory Phase 5 cleanup process that closes all generated PRs and issues and removes worktrees to ensure your environment remains clean.

What metrics are included in the final report?

The report tracks efficiency metrics (duration, turns, cost, tool calls), quality scores from reviewer agents, and workflow adherence metrics like subagent usage.

How do I activate this skill?

The skill auto-activates whenever you ask Claude to perform model benchmarking, comparison evaluations, or performance testing between AI models.

AI Model Evaluation Benchmark

Name: AI Model Evaluation Benchmark
Author: rysweet

byrysweet

•

데이터 과학 및 ML

Orchestrates end-to-end model evaluation benchmarks to measure efficiency, code quality, and workflow adherence across different AI models.

소개

The AI Model Evaluation Benchmark skill provides a standardized framework for comparing large language models within agentic coding workflows. By automating the Benchmark Suite V3 reference implementation, it executes multi-phase tests that assess model efficiency (cost, turns, duration), code generation quality via reviewer agents, and compliance with complex multi-step workflows. This skill manages the entire lifecycle—from environment setup and parallel task execution to automated markdown report generation and mandatory cleanup of GitHub artifacts—ensuring reproducible, data-driven insights for model selection and optimization.

주요 기능

Standardized markdown report generation with GitHub issue integration
Automated code quality scoring using specialized reviewer subagents
16 GitHub stars
Mandatory automated cleanup of PRs, issues, and git worktrees
Comparative analysis of models like Claude 3.5 Sonnet and Opus
Automated Benchmark Suite V3 reproduction and execution

사용 사례

Generating reproducible performance reports for AI model selection and updates
Comparing model performance and cost-efficiency for specific agentic coding tasks
Measuring model adherence to complex multi-agent workflow steps

소개

주요 기능

Standardized markdown report generation with GitHub issue integration
Automated code quality scoring using specialized reviewer subagents
16 GitHub stars
Mandatory automated cleanup of PRs, issues, and git worktrees
Comparative analysis of models like Claude 3.5 Sonnet and Opus
Automated Benchmark Suite V3 reproduction and execution

사용 사례

Generating reproducible performance reports for AI model selection and updates
Comparing model performance and cost-efficiency for specific agentic coding tasks
Measuring model adherence to complex multi-agent workflow steps