概要
The LLM Evaluation skill provides a robust toolkit for measuring and improving the quality of AI-driven applications. It enables developers to move beyond vibes-based testing by implementing standardized evaluation strategies, ranging from traditional NLP metrics like BLEU and ROUGE to modern LLM-as-judge patterns and statistical A/B testing. Whether you are optimizing RAG retrieval, comparing model performance, or establishing safety baselines, this skill provides the necessary patterns to ensure your LLM outputs are accurate, grounded, and production-ready.