Acerca de
This skill provides a comprehensive methodology for assessing the quality, safety, and reliability of LLM systems. It enables developers to implement structured testing strategies ranging from simple unit evaluations and classification metrics to sophisticated RAGAS scoring for retrieval-augmented pipelines. By leveraging layered evaluation patterns—including automated checks, LLM-as-judge scoring, and safety audits for hallucinations or bias—it ensures that AI applications meet production-grade standards and performance benchmarks.