About
The LLM Evaluation skill provides a robust framework for measuring and improving the performance of AI applications. It enables developers to implement systematic testing across multiple dimensions, including automated text metrics like BERTScore and ROUGE, LLM-as-judge patterns for qualitative assessment, and rigorous statistical A/B testing. By providing standardized tools for regression detection and RAG-specific retrieval metrics, this skill helps teams build confidence in their production AI systems and validate improvements from prompt engineering or model fine-tuning.