소개
This skill provides a robust toolkit for systematically assessing the quality and performance of Large Language Model applications throughout the development lifecycle. It covers the entire evaluation spectrum, from standard NLP metrics like BLEU and BERTScore to advanced 'LLM-as-judge' patterns for semantic assessment and human annotation workflows. Designed for developers building production-grade AI, it includes statistical frameworks for A/B testing, regression detection to prevent performance drops, and specific metrics for Retrieval-Augmented Generation (RAG) systems, enabling data-driven decisions instead of relying on anecdotal model responses.