概要
The LLM Evaluation skill provides a rigorous, standardized approach to measuring the performance and quality of Large Language Model applications. It equips developers with the tools necessary to move beyond anecdotal testing by implementing automated NLP metrics like BLEU and BERTScore, RAG-specific retrieval analysis, and sophisticated LLM-as-judge patterns for qualitative assessment. Whether you are comparing model versions, validating prompt engineering changes, or setting up production regression tests, this skill ensures your AI features are grounded in data-driven benchmarks and statistical significance.