What does the Model Evaluation Benchmark skill do?

It automates the reproduction of standardized benchmarks to compare AI model performance, measuring efficiency, code quality, and workflow adherence.

Does it handle cleanup after running benchmarks?

Yes, the skill includes a mandatory cleanup phase that closes benchmark PRs/issues and removes git worktrees to maintain repository hygiene.

Which models can I compare with this skill?

While it defaults to comparing Claude Opus and Sonnet, it can be configured to benchmark any models supported by the underlying Benchmark Suite V3 framework.

What metrics are included in the benchmark report?

Reports include efficiency data (duration, turns, cost), quality scores from reviewer agents, and workflow compliance metrics like subagent usage.

How is code quality measured in this benchmark?

The skill launches parallel reviewer subagents to analyze trace logs and score the generated code on a standardized 1-5 scale.

Model Evaluation Benchmark

Name: Model Evaluation Benchmark
Author: rysweet

byrysweet

•

데이터 과학 및 ML

Automates comprehensive AI model benchmarking and performance comparison using the Benchmark Suite V3 framework.

This skill enables developers to systematically evaluate and compare AI models within agentic workflows by orchestrating end-to-end benchmarks. It measures critical metrics such as execution efficiency, code quality through reviewer agents, and workflow adherence while automating the entire lifecycle from setup and execution to reporting and mandatory cleanup. It is particularly useful for teams needing objective data to decide between models like Claude 3.5 Sonnet and Claude 3 Opus for specific production coding tasks.

주요 기능

0116 GitHub stars

02Automated multi-model performance comparison and execution

03Comprehensive markdown reporting with GitHub integration for issues and releases

04Automated code quality scoring using specialized reviewer subagents

05Mandatory cleanup protocols to manage PRs, worktrees, and temporary artifacts

06Deep analysis of efficiency metrics including cost, tool calls, and turn counts

사용 사례

01Comparing different LLM versions for specific agentic coding workflows

02Measuring the impact of framework updates on model performance and tool usage

03Generating reproducible performance reports for AI-driven software development

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add rysweet/amplihack model-evaluation-benchmark

For use in Claude.ai and ChatGPT

Download Skill