소개
This skill provides a comprehensive framework for testing and validating Large Language Model (LLM) outputs within the Claude Code environment. It implements the '95% Variance Finding'—the research-backed insight that prompt quality and sampling account for nearly all output variation—to focus developer efforts where they matter most. By providing templates for multi-dimensional rubrics, automated LLM-as-judge prompts, and strategies for handling non-determinism, it enables developers to build rigorous quality gates, perform A/B testing on prompts, and detect regressions in AI-powered production systems.