What metrics does the skill track?

It measures efficiency (duration, turns, cost), quality (reviewer scores), workflow adherence (subagent calls), and artifact generation (PRs and documentation).

How is the code quality scored?

The skill launches parallel reviewer subagents that analyze trace logs and score the generated code on a 1-5 scale based on quality and logic.

What models can I benchmark with this skill?

By default, the skill is configured to benchmark Claude 3.5 Opus and Sonnet models, but it can be used to compare various model versions within your agentic workflows.

Does this skill handle cleanup?

Yes, the skill includes a mandatory cleanup phase that closes benchmark PRs and issues and removes temporary git worktrees to maintain a clean repository.

Model Evaluation Benchmark

Name: Model Evaluation Benchmark
Author: rysweet

byrysweet

데이터 과학 및 ML

Automates comprehensive AI model evaluation benchmarks to measure efficiency, code quality, and workflow adherence.

소개

This skill provides a standardized framework for executing the Benchmark Suite V3 reference implementation within Claude Code, enabling developers to quantitatively compare AI models. It orchestrates a multi-phase workflow that includes automated setup, the execution of complex tasks, and reviewer-agent-led analysis of code quality and tool usage. By measuring metrics such as duration, cost, and workflow compliance, this skill generates detailed markdown reports and GitHub artifacts, making it an essential tool for teams validating agentic AI behaviors and ensuring high-performance standards in automated development pipelines.

주요 기능

0 GitHub stars
Code quality scoring via specialized reviewer subagents
Automated execution of Benchmark Suite V3 reference implementation
Comparative analysis of model efficiency, cost, and tool usage
Mandatory cleanup protocols for PRs, issues, and worktrees
Automated reporting with GitHub issue and release integration

사용 사례

Measuring AI agent adherence to complex multi-step developer workflows
Comparing performance and cost between different Claude model versions
Generating reproducible performance reports for AI-driven development

소개

주요 기능

0 GitHub stars
Code quality scoring via specialized reviewer subagents
Automated execution of Benchmark Suite V3 reference implementation
Comparative analysis of model efficiency, cost, and tool usage
Mandatory cleanup protocols for PRs, issues, and worktrees
Automated reporting with GitHub issue and release integration

사용 사례

Measuring AI agent adherence to complex multi-step developer workflows
Comparing performance and cost between different Claude model versions
Generating reproducible performance reports for AI-driven development