Can this skill help evaluate RAG applications?

Yes, it includes specialized metrics for Retrieval-Augmented Generation, such as Mean Reciprocal Rank (MRR) for retrieval quality and groundedness checks to ensure answers are based on provided context.

What is the 'LLM-as-Judge' approach?

LLM-as-Judge is an evaluation pattern where a more capable model, like Claude 3.5 Sonnet, is used to grade the outputs of another model based on specific rubrics like accuracy, helpfulness, or tone.

Is statistical validation included for A/B testing?

Absolutely. The skill provides a statistical framework that uses T-tests and Cohen’s d to determine if the performance difference between two model variants is statistically significant.

How does regression testing work for LLMs?

The skill includes a RegressionDetector that compares current evaluation scores against a historical baseline. It automatically flags significant drops in performance across metrics like accuracy or toxicity.

LLM Evaluation & Benchmarking

Name: LLM Evaluation & Benchmarking
Author: HermeticOrmus

byHermeticOrmus

0•

データサイエンスとML

Implements systematic evaluation strategies for LLM applications using automated metrics, human feedback loops, and benchmarking frameworks.

The llm-evaluation skill empowers developers to move beyond subjective 'vibe checks' and establish rigorous, data-driven testing for AI applications. It provides standardized implementations for automated NLP metrics (BLEU, ROUGE, BERTScore), LLM-as-judge patterns, and human annotation workflows. Whether you are validating RAG retrieval accuracy, comparing model performance via A/B testing, or monitoring for regressions during deployment, this skill offers the necessary tools to measure accuracy, coherence, and safety systematically across the entire development lifecycle.

主な機能

01Statistical A/B testing framework with Cohen's d effect size analysis

02RAG-specific evaluation metrics like MRR, NDCG, and groundedness checks

03Automated NLP metrics including BLEU, ROUGE, and BERTScore

040 GitHub stars

05LLM-as-Judge patterns for qualitative and pairwise scoring

06Automated regression detection to prevent performance degradation

ユースケース

01Validating prompt engineering improvements with statistical significance

02Comparing performance differences when switching between LLM providers

03Benchmarking RAG systems for retrieval relevance and factual accuracy

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/floreserlife llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill