소개
This skill provides a comprehensive framework for building production-grade evaluation pipelines using the LLM-as-a-Judge paradigm. It equips Claude with the ability to perform direct scoring and pairwise comparisons while actively mitigating systematic biases such as position, length, and verbosity. By utilizing structured rubrics and chain-of-thought justification, it transforms subjective assessments into reliable, actionable metrics for model benchmarking, A/B testing, and quality assurance in AI-driven applications. It is ideal for developers needing to establish consistent quality standards across AI systems.