Does this skill require manual cleanup after running?

No, the skill includes a mandatory Phase 5 cleanup process that automatically closes PRs/issues and removes git worktrees used during the benchmark.

What specific metrics does the benchmark track?

It tracks efficiency (duration, turns, cost, tool calls), quality (1-5 code scores via reviewer agents), and workflow adherence (sub-agent calls and step compliance).

What models can this skill evaluate?

The skill is optimized for comparing Claude models like Opus and Sonnet within the Benchmark Suite V3 framework, but it can be configured to test various models supported by the underlying runner.

Where are the benchmarking results stored?

Results are stored locally in the runtime benchmark directory and summarized in a markdown report, which can also be automatically uploaded as a GitHub issue or release artifact.

Model Evaluation Benchmark

Name: Model Evaluation Benchmark
Author: rysweet

byrysweet

•

Ciencia de Datos y ML

Orchestrates comprehensive AI model evaluation benchmarks to measure efficiency, code quality, and workflow adherence.

Acerca de

The Model Evaluation Benchmark skill provides an automated framework for reproducing the Benchmark Suite V3 reference implementation. It enables developers to perform side-by-side comparisons of different AI models, such as Claude 3.5 Sonnet and Opus, by measuring critical performance indicators like token cost, execution duration, and tool usage. This skill automates the entire benchmarking lifecycle—from task execution and reviewer-agent analysis to report generation and environment cleanup—ensuring that evaluations of agentic workflows remain reproducible, data-driven, and resource-efficient.

Características Principales

Agentic code quality scoring via reviewer sub-agents
Mandatory cleanup protocols for PRs, issues, and git worktrees
Automated execution of Benchmark Suite V3 tasks
Automated GitHub report generation and artifact archiving
16 GitHub stars
Side-by-side model comparison for efficiency and cost

Casos de Uso

Generating reproducible benchmarking reports for engineering stakeholders
Comparing performance and cost between different Claude model versions
Validating agentic workflow adherence and tool call efficiency

Acerca de

Características Principales

Agentic code quality scoring via reviewer sub-agents
Mandatory cleanup protocols for PRs, issues, and git worktrees
Automated execution of Benchmark Suite V3 tasks
Automated GitHub report generation and artifact archiving
16 GitHub stars
Side-by-side model comparison for efficiency and cost

Casos de Uso

Generating reproducible benchmarking reports for engineering stakeholders
Comparing performance and cost between different Claude model versions
Validating agentic workflow adherence and tool call efficiency