This skill equips Claude with advanced patterns for benchmarking LLM applications, moving beyond traditional software metrics to objective, LLM-driven evaluation. It provides standardized workflows for Ragas, DeepEval, and LlamaIndex to measure faithfulness, detect hallucinations, and scale testing through synthetic dataset generation. By implementing 'LLM-as-a-judge' patterns, it helps developers ensure their AI outputs remain accurate, professional, and grounded in provided context during model upgrades or prompt iterations.
主要功能
01LLM-as-a-judge implementation using high-reasoning models for grading
02Automated synthetic data generation for high-scale regression testing
03RAG benchmarking with Faithfulness and Retrieval Relevance metrics
04Unit testing patterns for LLM outputs using DeepEval and GEval
050 GitHub stars
06Regression testing workflows to prevent quality degradation during updates