概要
This skill provides a robust framework for measuring and optimizing the quality of AI-driven applications. It equips developers with a suite of tools ranging from traditional NLP metrics like BLEU and ROUGE to modern 'LLM-as-judge' patterns and statistical A/B testing. Whether you are validating RAG performance, comparing model providers, or setting up automated regression testing in a CI/CD pipeline, this skill offers the implementation patterns needed to move from anecdotal prompt testing to data-driven performance engineering.