소개
The LLM Evaluation skill provides a robust toolkit for systematically assessing the performance of Large Language Model applications. It bridges the gap between development and production by offering standardized methods for automated metric calculation (like BERTScore, ROUGE, and BLEU), LLM-as-Judge patterns, and human annotation workflows. Whether you are validating prompt iterations, comparing model providers, or protecting against regressions, this skill provides the statistical and procedural rigor necessary to build reliable AI systems.