About
The llm-evaluation skill provides a robust framework for measuring and optimizing the performance of Large Language Model applications. It empowers developers to move beyond vibes-based testing by implementing rigorous automated metrics like BLEU and BERTScore, establishing LLM-as-judge patterns for qualitative assessment, and managing human-in-the-loop feedback. Whether you are validating prompt engineering changes, measuring RAG retrieval quality, or conducting statistical A/B tests between model versions, this skill provides the patterns and implementation logic needed to ensure production-grade AI reliability and detect performance regressions before they reach users.