概要
The Agent Evaluation skill provides a robust framework for assessing the effectiveness of AI agents and Claude Code commands. It addresses the inherent challenges of non-deterministic AI behavior by shifting the focus from rigid execution paths to outcome-based metrics and process quality. By implementing multi-dimensional rubrics—covering instruction following, completeness, reasoning quality, and tool efficiency—developers can systematically identify performance bottlenecks. The skill also incorporates advanced 'LLM-as-judge' techniques with specific strategies to mitigate common biases like position and length bias, ensuring that context engineering choices are validated against real-world complexity levels and token constraints.