Can this skill help evaluate RAG applications?

Yes, it includes specialized metrics for Retrieval-Augmented Generation, such as Mean Reciprocal Rank (MRR) for retrieval quality and groundedness checks to ensure answers are based on provided context.

What is the 'LLM-as-Judge' approach?

LLM-as-Judge is an evaluation pattern where a more capable model, like Claude 3.5 Sonnet, is used to grade the outputs of another model based on specific rubrics like accuracy, helpfulness, or tone.

Is statistical validation included for A/B testing?

Absolutely. The skill provides a statistical framework that uses T-tests and Cohen’s d to determine if the performance difference between two model variants is statistically significant.

How does regression testing work for LLMs?

The skill includes a RegressionDetector that compares current evaluation scores against a historical baseline. It automatically flags significant drops in performance across metrics like accuracy or toxicity.

LLM Evaluation & Benchmarking

Name: LLM Evaluation & Benchmarking
Author: HermeticOrmus

byHermeticOrmus

데이터 과학 및 ML

Implements systematic evaluation strategies for LLM applications using automated metrics, human feedback loops, and benchmarking frameworks.

소개

The llm-evaluation skill empowers developers to move beyond subjective 'vibe checks' and establish rigorous, data-driven testing for AI applications. It provides standardized implementations for automated NLP metrics (BLEU, ROUGE, BERTScore), LLM-as-judge patterns, and human annotation workflows. Whether you are validating RAG retrieval accuracy, comparing model performance via A/B testing, or monitoring for regressions during deployment, this skill offers the necessary tools to measure accuracy, coherence, and safety systematically across the entire development lifecycle.

주요 기능

Statistical A/B testing framework with Cohen's d effect size analysis
RAG-specific evaluation metrics like MRR, NDCG, and groundedness checks
Automated NLP metrics including BLEU, ROUGE, and BERTScore
0 GitHub stars
LLM-as-Judge patterns for qualitative and pairwise scoring
Automated regression detection to prevent performance degradation

사용 사례

Validating prompt engineering improvements with statistical significance
Comparing performance differences when switching between LLM providers
Benchmarking RAG systems for retrieval relevance and factual accuracy

소개

주요 기능

Statistical A/B testing framework with Cohen's d effect size analysis
RAG-specific evaluation metrics like MRR, NDCG, and groundedness checks
Automated NLP metrics including BLEU, ROUGE, and BERTScore
0 GitHub stars
LLM-as-Judge patterns for qualitative and pairwise scoring
Automated regression detection to prevent performance degradation

사용 사례

Validating prompt engineering improvements with statistical significance
Comparing performance differences when switching between LLM providers
Benchmarking RAG systems for retrieval relevance and factual accuracy