Acerca de
This skill provides a complete toolkit for systematically measuring and improving the performance of LLM-based applications. It covers essential evaluation methodologies including automated n-gram and embedding-based metrics (BLEU, ROUGE, BERTScore), LLM-as-judge patterns for semantic assessment, and structured human annotation frameworks. Whether you are comparing model versions, detecting performance regressions in CI/CD, or validating prompt engineering changes, this skill provides the implementation patterns and statistical analysis tools needed to build production-grade confidence in AI systems.