Acerca de
The LLM Evaluation skill provides a comprehensive toolkit for assessing the performance, quality, and reliability of Large Language Model applications. It equips developers with the methodologies needed to move beyond 'vibes-based' testing by calculating automated metrics like BERTScore and ROUGE, implementing LLM-as-judge patterns, and conducting rigorous A/B testing. Whether you are validating prompt engineering changes, benchmarking different foundation models, or building a RAG pipeline, this skill provides the statistical framework and code patterns required to establish production-grade quality baselines and detect performance regressions.