Acerca de
The Model Evaluation Benchmark skill provides an automated framework for reproducing the Benchmark Suite V3 reference implementation. It enables developers to perform side-by-side comparisons of different AI models, such as Claude 3.5 Sonnet and Opus, by measuring critical performance indicators like token cost, execution duration, and tool usage. This skill automates the entire benchmarking lifecycle—from task execution and reviewer-agent analysis to report generation and environment cleanup—ensuring that evaluations of agentic workflows remain reproducible, data-driven, and resource-efficient.