Acerca de
This skill provides a comprehensive framework for building sophisticated evaluation systems for Large Language Models. It synthesizes academic research and industry best practices to help developers implement automated quality assessments, including direct scoring and pairwise comparison protocols. By addressing systemic issues like position bias, length bias, and verbosity bias, the skill enables the creation of consistent, human-aligned evaluation pipelines. It is particularly useful for establishing quality standards in production, designing A/B tests for prompts, and generating domain-specific rubrics for automated scoring.