소개
The llm-evaluation skill empowers developers to move beyond subjective 'vibe checks' and establish rigorous, data-driven testing for AI applications. It provides standardized implementations for automated NLP metrics (BLEU, ROUGE, BERTScore), LLM-as-judge patterns, and human annotation workflows. Whether you are validating RAG retrieval accuracy, comparing model performance via A/B testing, or monitoring for regressions during deployment, this skill offers the necessary tools to measure accuracy, coherence, and safety systematically across the entire development lifecycle.