소개
This skill provides a standardized framework for executing the Benchmark Suite V3 reference implementation within Claude Code, enabling developers to quantitatively compare AI models. It orchestrates a multi-phase workflow that includes automated setup, the execution of complex tasks, and reviewer-agent-led analysis of code quality and tool usage. By measuring metrics such as duration, cost, and workflow compliance, this skill generates detailed markdown reports and GitHub artifacts, making it an essential tool for teams validating agentic AI behaviors and ensuring high-performance standards in automated development pipelines.