소개
The llm-evaluation skill provides a comprehensive toolkit for measuring and improving the quality of AI applications by moving beyond anecdotal testing to systematic measurement. It enables developers to implement standard NLP metrics like BLEU and ROUGE, leverage advanced embedding-based scores like BERTScore, and utilize 'LLM-as-Judge' patterns for automated qualitative assessment. By integrating regression detection, statistical A/B testing frameworks, and benchmarking suites, this skill ensures that model updates, prompt changes, and RAG pipeline adjustments result in measurable improvements rather than silent performance regressions.