About
This skill equips Claude with advanced patterns for benchmarking LLM applications, moving beyond traditional software metrics to objective, LLM-driven evaluation. It provides standardized workflows for Ragas, DeepEval, and LlamaIndex to measure faithfulness, detect hallucinations, and scale testing through synthetic dataset generation. By implementing 'LLM-as-a-judge' patterns, it helps developers ensure their AI outputs remain accurate, professional, and grounded in provided context during model upgrades or prompt iterations.