关于
This skill provides a structured framework for measuring and improving the quality of Large Language Model applications through rigorous testing methodologies. It equips developers with the tools to implement automated text metrics like BLEU and BERTScore, establish human annotation workflows, and utilize advanced 'LLM-as-judge' patterns for semantic quality assessment. Whether you are comparing model versions, detecting performance regressions before deployment, or running A/B tests on prompts, this skill ensures your AI systems remain reliable, factually accurate, and performant over time.