Acerca de
This skill provides specialized knowledge to build and execute sophisticated evaluation systems for Large Language Models. It synthesizes industry best practices and academic research to implement direct scoring and pairwise comparison methodologies while actively mitigating systematic biases like position and length bias. Whether you are building automated evaluation pipelines, designing complex rubrics, or conducting A/B tests for prompt engineering, this skill ensures your AI assessments are consistent, reliable, and closely aligned with human judgment.