关于
This skill provides a structured framework for measuring the quality, accuracy, and reliability of LLM-powered applications. It enables developers to implement automated metrics like BERTScore and ROUGE, establish LLM-as-judge evaluation patterns, and manage human annotation workflows. By integrating systematic benchmarking and A/B testing, the skill helps teams detect performance regressions, compare model variations, and build confidence in production AI systems through rigorous statistical validation and factual grounding checks.