About
This skill provides a robust framework for measuring and improving the performance of Large Language Model applications. It equips developers with standardized tools for automated metrics like BLEU and ROUGE, advanced semantic evaluation via BERTScore, and sophisticated 'LLM-as-judge' patterns for subjective quality assessment. By integrating regression detection and statistical A/B testing, it enables data-driven decision-making for prompt engineering, model selection, and production monitoring, ensuring AI outputs remain accurate, safe, and helpful.