About
This skill provides a comprehensive framework for building reliable LLM evaluation systems using automated judge models. It bridges the gap between raw model outputs and production-ready quality by implementing research-backed techniques like pairwise comparison, direct scoring rubrics, and systematic bias mitigation. Whether you are debugging inconsistent results or setting up A/B testing pipelines, this skill equips you with the metrics, prompt structures, and decision frameworks necessary to achieve high-fidelity automated assessments that correlate closely with human judgment.