关于
The llm-evaluation skill provides a systematic approach to measuring and improving the quality of AI-driven applications. It enables developers to implement a multi-layered evaluation strategy encompassing automated text metrics (BLEU, ROUGE, BERTScore), LLM-as-Judge patterns for semantic assessment, and structured human evaluation frameworks. By integrating statistical A/B testing and regression detection, this skill helps teams confidently validate prompt changes, compare model performance, and ensure production-grade reliability across text generation, classification, and RAG tasks.