概要
This skill provides comprehensive methodologies for assessing autonomous agent systems, moving beyond traditional software testing to address non-determinism and complex decision-making. It enables developers to implement LLM-as-judge frameworks, design outcome-focused rubrics across factual and process-oriented dimensions, and validate context engineering choices. By establishing systematic quality gates and performance benchmarks, it ensures agent pipelines remain reliable, catches regressions before deployment, and optimizes the balance between token usage and model performance.