Does this support RAG evaluation?

Yes, it provides specific metrics for retrieval quality, such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG), plus custom groundedness checks.

Which metrics are best for summarization tasks?

ROUGE and BERTScore are generally preferred for summarization, as they measure recall and semantic similarity between the summary and the source text.

What is the 'LLM-as-Judge' approach?

It involves using a high-capability model (like Claude 3 Opus or GPT-4) to evaluate the outputs of smaller or specialized models based on specific rubrics like accuracy, tone, or relevance.

How do I detect regressions in my AI app?

This skill includes a RegressionDetector that compares current model performance against a saved baseline and flags significant drops in key metrics.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: HermeticOrmus

byHermeticOrmus

データサイエンスとML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

概要

The LLM Evaluation skill provides a rigorous, standardized approach to measuring the performance and quality of Large Language Model applications. It equips developers with the tools necessary to move beyond anecdotal testing by implementing automated NLP metrics like BLEU and BERTScore, RAG-specific retrieval analysis, and sophisticated LLM-as-judge patterns for qualitative assessment. Whether you are comparing model versions, validating prompt engineering changes, or setting up production regression tests, this skill ensures your AI features are grounded in data-driven benchmarks and statistical significance.

主な機能

Statistical A/B testing with significance and effect size analysis
Retrieval-Augmented Generation (RAG) metrics like MRR and NDCG
0 GitHub stars
LLM-as-Judge patterns for pointwise and pairwise evaluation
Regression detection to prevent performance drops during deployment
Automated NLP metrics (BLEU, ROUGE, BERTScore, METEOR)

ユースケース

Establishing quality baselines and automated CI/CD evaluation gates
Validating the accuracy and groundedness of RAG pipelines
Comparing performance across different LLM models or prompt iterations

概要

主な機能

Statistical A/B testing with significance and effect size analysis
Retrieval-Augmented Generation (RAG) metrics like MRR and NDCG
0 GitHub stars
LLM-as-Judge patterns for pointwise and pairwise evaluation
Regression detection to prevent performance drops during deployment
Automated NLP metrics (BLEU, ROUGE, BERTScore, METEOR)

ユースケース

Establishing quality baselines and automated CI/CD evaluation gates
Validating the accuracy and groundedness of RAG pipelines
Comparing performance across different LLM models or prompt iterations