소개
This skill provides a comprehensive toolkit for measuring and optimizing the performance of LLM-based applications. It enables developers to implement industry-standard metrics like BERTScore and ROUGE, establish LLM-as-judge patterns for qualitative assessment, and manage structured human evaluation workflows. By integrating regression detection and statistical A/B testing, it ensures that prompt engineering and model updates lead to measurable improvements while maintaining high production quality and reliability across diverse use cases.