Which automated metrics are supported for text generation?

The skill includes implementations for BLEU (n-gram overlap), ROUGE (recall-oriented), METEOR (semantic), and BERTScore (embedding-based similarity).

Can this skill evaluate RAG (Retrieval-Augmented Generation) applications?

Yes, it provides specific metrics for retrieval quality, including Mean Reciprocal Rank (MRR), NDCG, and Precision/Recall @ K, as well as groundedness checks.

Does it support statistical significance testing?

Yes, the skill includes an A/B testing framework that calculates T-tests, p-values, and Cohen's d to determine if model improvements are statistically significant.

What is the benefit of the 'LLM-as-judge' approach?

It allows a more capable model to evaluate the nuance, helpfulness, and safety of a target model's output, providing high-quality qualitative feedback at scale.

LLM Evaluation & Benchmarking

Name: LLM Evaluation & Benchmarking
Author: HermeticOrmus

byHermeticOrmus

データサイエンスとML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

概要

The llm-evaluation skill provides a robust toolkit for measuring and optimizing the performance of AI-driven applications. It enables developers to implement a wide array of assessment strategies, ranging from traditional NLP metrics like BLEU and ROUGE to advanced embedding-based assessments like BERTScore. Beyond simple metrics, it facilitates complex evaluation patterns such as LLM-as-judge for qualitative analysis, statistical A/B testing for model comparison, and specialized RAG evaluation to ensure retrieval accuracy. This skill is essential for developers looking to move beyond vibes-based development to a data-driven approach that ensures production readiness and prevents regressions.

主な機能

Comprehensive automated metrics including BLEU, ROUGE, BERTScore, and Perplexity
LLM-as-judge implementation for automated pointwise and pairwise response grading
Specialized RAG evaluation metrics for retrieval (MRR, NDCG) and groundedness
Statistical A/B testing framework with p-value and Cohen’s d effect size analysis
Automated regression detection to identify performance drops before deployment
0 GitHub stars

ユースケース

Systematically comparing the performance of different prompts or model versions
Validating the accuracy and retrieval quality of RAG-based systems
Setting up automated evaluation pipelines within CI/CD workflows

概要

主な機能

Comprehensive automated metrics including BLEU, ROUGE, BERTScore, and Perplexity
LLM-as-judge implementation for automated pointwise and pairwise response grading
Specialized RAG evaluation metrics for retrieval (MRR, NDCG) and groundedness
Statistical A/B testing framework with p-value and Cohen’s d effect size analysis
Automated regression detection to identify performance drops before deployment
0 GitHub stars

ユースケース

Systematically comparing the performance of different prompts or model versions
Validating the accuracy and retrieval quality of RAG-based systems
Setting up automated evaluation pipelines within CI/CD workflows