概要
This skill provides specialized guidance for testing and validating AI agent systems, moving beyond traditional software testing to address non-determinism and complex reasoning paths. It enables developers to implement LLM-as-judge methodologies, design outcome-focused rubrics, and validate context engineering strategies through systematic performance measurement. By focusing on multi-dimensional quality gates—including factual accuracy, tool efficiency, and token usage—it helps teams catch regressions, optimize token budgets, and ensure agents reliably achieve their goals in production environments.