关于
The llm-evaluation skill provides a robust toolkit for systematically assessing the performance, quality, and reliability of AI applications. It enables developers to implement standard automated metrics like BLEU and BERTScore, establish sophisticated LLM-as-judge patterns for semantic quality, and perform rigorous A/B testing with statistical significance. By bridging the gap between raw model output and production-grade reliability, this skill helps teams validate prompt engineering changes, detect regressions early in the development lifecycle, and build verifiable confidence in complex agentic or RAG-based systems.