Which automated metrics are supported for text generation?

The skill includes implementations for BLEU (n-gram overlap), ROUGE (recall-oriented), METEOR (semantic similarity), BERTScore (embedding-based), and Perplexity.

What is the 'LLM-as-judge' approach included in this skill?

It is a pattern that uses a highly capable model to evaluate the output of other models or prompts based on specific criteria like accuracy, helpfulness, and tone, supporting both single-output scoring and pairwise comparisons.

How does the regression detection work?

It compares current evaluation results against a stored baseline and flags any metric that drops beyond a configurable threshold, ensuring that prompt changes don't accidentally break existing functionality.

Can this skill help evaluate RAG (Retrieval-Augmented Generation) systems?

Yes, it provides specific metrics for retrieval quality, such as Mean Reciprocal Rank (MRR) and NDCG, as well as groundedness checks to ensure responses are based on provided context.

LLM Evaluation & Metrics

Name: LLM Evaluation & Metrics
Author: HermeticOrmus

byHermeticOrmus

데이터 과학 및 ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

소개

This skill provides a robust toolkit for systematically assessing the quality and performance of Large Language Model applications throughout the development lifecycle. It covers the entire evaluation spectrum, from standard NLP metrics like BLEU and BERTScore to advanced 'LLM-as-judge' patterns for semantic assessment and human annotation workflows. Designed for developers building production-grade AI, it includes statistical frameworks for A/B testing, regression detection to prevent performance drops, and specific metrics for Retrieval-Augmented Generation (RAG) systems, enabling data-driven decisions instead of relying on anecdotal model responses.

주요 기능

Automated NLP metrics including BLEU, ROUGE, and BERTScore
Automated regression detection to identify performance drops before deployment
LLM-as-Judge patterns for pointwise and pairwise qualitative evaluation
0 GitHub stars
RAG-specific metrics like MRR, NDCG, and groundedness checks
Statistical A/B testing framework with p-value and effect size analysis

사용 사례

Comparing model performance across different prompt versions or hyperparameters
Validating the accuracy and groundedness of RAG-based search results
Establishing automated CI/CD testing pipelines for AI application quality

소개

주요 기능

Automated NLP metrics including BLEU, ROUGE, and BERTScore
Automated regression detection to identify performance drops before deployment
LLM-as-Judge patterns for pointwise and pairwise qualitative evaluation
0 GitHub stars
RAG-specific metrics like MRR, NDCG, and groundedness checks
Statistical A/B testing framework with p-value and effect size analysis

사용 사례

Comparing model performance across different prompt versions or hyperparameters
Validating the accuracy and groundedness of RAG-based search results
Establishing automated CI/CD testing pipelines for AI application quality