Acerca de
This skill enables developers to systematically evaluate and compare AI models within agentic workflows by orchestrating end-to-end benchmarks. It measures critical metrics such as execution efficiency, code quality through reviewer agents, and workflow adherence while automating the entire lifecycle from setup and execution to reporting and mandatory cleanup. It is particularly useful for teams needing objective data to decide between models like Claude 3.5 Sonnet and Claude 3 Opus for specific production coding tasks.