소개
The AI Model Evaluation Benchmark skill provides a standardized framework for comparing large language models within agentic coding workflows. By automating the Benchmark Suite V3 reference implementation, it executes multi-phase tests that assess model efficiency (cost, turns, duration), code generation quality via reviewer agents, and compliance with complex multi-step workflows. This skill manages the entire lifecycle—from environment setup and parallel task execution to automated markdown report generation and mandatory cleanup of GitHub artifacts—ensuring reproducible, data-driven insights for model selection and optimization.