概要
The Evaluation skill provides a systematic approach to assessing complex agent systems where traditional software testing often falls short. It addresses the unique challenges of non-determinism and context-dependent failures by offering outcome-focused methodologies, including multi-dimensional scoring rubrics and LLM-as-judge automation. By focusing on factors like factual accuracy, tool efficiency, and complexity stratification, this skill enables developers to build quality gates, validate context engineering strategies, and implement continuous evaluation pipelines that ensure agents maintain high standards of reliability and efficiency throughout their lifecycle.