소개
This skill provides a comprehensive framework for evaluating complex agentic systems where traditional software testing falls short. It addresses the challenges of non-determinism and context-dependent failures by implementing outcome-focused evaluation methods, multi-dimensional scoring rubrics, and systematic test set design. Whether you are validating context engineering choices or building quality gates for production pipelines, this skill offers the guidance needed to measure improvements, catch regressions, and ensure agent reliability across various complexity levels from simple lookups to deep research synthesis.