概要
LLM Evaluation provides a robust framework for systematically measuring the performance of language model applications through automated metrics, human-in-the-loop workflows, and advanced LLM-as-judge patterns. It enables developers to move beyond subjective testing by providing implementation guidance for classic NLP metrics like BLEU and ROUGE alongside modern embedding-based scores and RAG evaluations. With built-in patterns for A/B testing, regression detection, and benchmarking, this skill ensures that model updates and prompt changes result in measurable improvements rather than unexpected regressions.