소개
This skill provides a complete toolkit for measuring and improving AI application quality through rigorous evaluation pipelines. It integrates automated metrics like BLEU, ROUGE, and BERTScore with sophisticated patterns like LLM-as-Judge and RAG-specific retrieval metrics. Whether you are comparing model providers, validating prompt engineering improvements, or setting up regression testing for production, this skill offers the frameworks needed to build confidence in your AI systems using statistical analysis and systematic benchmarking.