概要
This skill provides a robust framework for assessing the quality and performance of LLM applications through a multi-layered approach. It covers a wide spectrum of evaluation techniques, including traditional linguistic metrics like BLEU and ROUGE, semantic evaluation using BERTScore, and modern LLM-as-judge patterns. Whether you are detecting performance regressions in CI/CD, comparing model variants through statistical A/B testing, or establishing human-in-the-loop annotation workflows, this skill helps ensure AI outputs remain accurate, safe, and helpful throughout the development lifecycle.