소개
This skill provides a comprehensive framework for testing and validating LLM agents, addressing the unique challenges of non-deterministic AI behavior. It enables developers to move beyond simple string matching by implementing behavioral contract testing, statistical evaluation patterns, and multi-dimensional reliability metrics. By bridging the gap between laboratory benchmarks and real-world performance, this skill helps identify issues like metric gaming, data leakage, and regression failures before they reach production environments.