概要
The LLM Performance Evaluation skill provides a robust framework for assessing the quality, reliability, and accuracy of Large Language Model outputs. It equips developers with tools for automated NLP metrics like BLEU and BERTScore, advanced 'LLM-as-judge' patterns for semantic grading, and structured human annotation workflows. Whether you are comparing model variants, validating prompt engineering changes, or setting up regression testing in a CI/CD pipeline, this skill ensures your AI applications meet production-grade standards through rigorous statistical analysis and multi-dimensional scoring.