Acerca de
This skill provides a comprehensive framework for validating LLM application outputs using a multi-layered evaluation approach. It enables developers to implement traditional metrics like ROUGE and BERTScore alongside advanced LLM-as-Judge patterns for nuanced, rubric-based scoring. Designed for production-grade AI development, it includes specialized modules for hallucination detection and RAG grounding checks, making it essential for building reliable benchmarks and establishing continuous evaluation pipelines within CI/CD workflows.