How does it handle regression testing?

It includes a RegressionDetector that compares new results against established baselines and flags any significant performance decreases based on custom thresholds.

What automated metrics does this skill support?

It supports a wide range of metrics including BLEU, ROUGE, METEOR, BERTScore for semantic similarity, and standard classification metrics like Precision, Recall, and F1.

What is the 'LLM-as-judge' pattern?

This pattern uses a highly capable model to evaluate the outputs of other models, allowing for automated grading of subjective qualities like helpfulness, coherence, and safety.

Can I use this for RAG (Retrieval-Augmented Generation) apps?

Yes, it includes specialized metrics for retrieval quality such as MRR, NDCG, and Precision@K, as well as NLI-based groundedness checks to detect hallucinations.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: HermeticOrmus

byHermeticOrmus

Data Science & ML

Implements comprehensive evaluation strategies for Large Language Model applications using automated metrics, human feedback, and comparative benchmarking.

About

The LLM Evaluation skill provides a robust framework for measuring and improving the performance of AI applications. It enables developers to implement systematic testing across multiple dimensions, including automated text metrics like BERTScore and ROUGE, LLM-as-judge patterns for qualitative assessment, and rigorous statistical A/B testing. By providing standardized tools for regression detection and RAG-specific retrieval metrics, this skill helps teams build confidence in their production AI systems and validate improvements from prompt engineering or model fine-tuning.

Key Features

LLM-as-judge evaluation patterns for pointwise and pairwise scoring
Automated regression detection to prevent performance drops in CI/CD
0 GitHub stars
Automated metric computation including BLEU, ROUGE, and BERTScore
RAG-specific metrics for retrieval quality and groundedness
Statistical A/B testing framework with Cohen's d effect size analysis

Use Cases

Comparing different model providers or prompt versions systematically
Measuring the factual accuracy and groundedness of RAG systems
Automating quality assurance in the AI development lifecycle

About

Key Features

LLM-as-judge evaluation patterns for pointwise and pairwise scoring
Automated regression detection to prevent performance drops in CI/CD
0 GitHub stars
Automated metric computation including BLEU, ROUGE, and BERTScore
RAG-specific metrics for retrieval quality and groundedness
Statistical A/B testing framework with Cohen's d effect size analysis

Use Cases

Comparing different model providers or prompt versions systematically
Measuring the factual accuracy and groundedness of RAG systems
Automating quality assurance in the AI development lifecycle