概要
The llm-evaluation skill provides a robust toolkit for measuring and optimizing the performance of AI-driven applications. It enables developers to implement a wide array of assessment strategies, ranging from traditional NLP metrics like BLEU and ROUGE to advanced embedding-based assessments like BERTScore. Beyond simple metrics, it facilitates complex evaluation patterns such as LLM-as-judge for qualitative analysis, statistical A/B testing for model comparison, and specialized RAG evaluation to ensure retrieval accuracy. This skill is essential for developers looking to move beyond vibes-based development to a data-driven approach that ensures production readiness and prevents regressions.