About
This skill provides a comprehensive framework for measuring and improving the quality of AI applications. It covers essential evaluation methodologies including traditional NLP metrics like BLEU and ROUGE, advanced embedding-based scores, and modern LLM-as-judge techniques. By integrating automated regression detection, statistical A/B testing, and human annotation workflows, it helps developers build confidence in production systems, validate prompt improvements, and systematically compare model performance across diverse use cases such as RAG systems and text generation.