关于
The Advanced Evaluation skill empowers developers to build reliable, automated assessment pipelines for LLM-powered applications. By synthesizing industry best practices and academic research, it enables Claude to perform sophisticated 'LLM-as-a-judge' tasks, including direct scoring and pairwise comparisons. The skill focuses heavily on scientific accuracy, providing specific protocols to mitigate common pitfalls like position bias, length bias, and verbosity bias. Whether you are benchmarking new prompts, debugging agent consistency, or establishing quality standards for production systems, this skill provides the frameworks and metrics necessary for high-confidence evaluation.