How does the 'LLM-as-Judge' pattern work?

It provides implementation patterns for using a highly capable model to grade the outputs of other models based on specific rubrics, either individually (pointwise) or by comparing two variants (pairwise).

What automated metrics are supported for text generation?

The skill supports a wide range of metrics including overlap-based scores (BLEU, ROUGE), semantic similarity (BERTScore, METEOR), and model confidence (Perplexity).

Does it include tools for human-in-the-loop evaluation?

It provides structured annotation guidelines and frameworks for calculating inter-rater agreement (Cohen's Kappa) to ensure manual evaluations are consistent and reliable.

How do I detect if my model performance is getting worse?

The skill includes a RegressionDetector that compares new results against a established baseline and flags significant drops in performance beyond a defined threshold.

Can this skill help evaluate RAG (Retrieval-Augmented Generation) systems?

Yes, it includes specific modules for evaluating retrieval performance using MRR and NDCG, as well as checking for groundedness and hallucination using NLI models.

LLM Evaluation & Benchmarking

Name: LLM Evaluation & Benchmarking
Author: HermeticOrmus

byHermeticOrmus

보안 및 테스팅

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human-in-the-loop feedback, and statistical A/B testing.

소개

This skill empowers developers to systematically measure and improve LLM application quality by providing a robust toolkit for performance assessment. It bridges the gap between raw model outputs and production-ready reliability through the implementation of standard metrics like BERTScore, ROUGE, and BLEU, alongside advanced 'LLM-as-Judge' patterns and statistical testing. Whether you are detecting performance regressions, validating prompt improvements, or benchmarking different models, this skill provides the necessary patterns to establish baselines and maintain high standards for AI-driven features.

주요 기능

Statistical A/B testing frameworks featuring T-tests and Cohen's d analysis
Regression detection to identify performance drops before deployment
0 GitHub stars
Automated NLP metrics including BLEU, ROUGE, METEOR, and BERTScore
Retrieval-specific metrics for RAG systems like MRR, NDCG, and Precision@K
LLM-as-Judge implementation for automated qualitative and pairwise assessments

사용 사례

Comparing the performance and accuracy of different model providers (e.g., OpenAI vs Anthropic)
Measuring the groundedness and factuality of RAG-based search results
Validating prompt engineering changes to ensure no loss in quality

소개

주요 기능

Statistical A/B testing frameworks featuring T-tests and Cohen's d analysis
Regression detection to identify performance drops before deployment
0 GitHub stars
Automated NLP metrics including BLEU, ROUGE, METEOR, and BERTScore
Retrieval-specific metrics for RAG systems like MRR, NDCG, and Precision@K
LLM-as-Judge implementation for automated qualitative and pairwise assessments

사용 사례

Comparing the performance and accuracy of different model providers (e.g., OpenAI vs Anthropic)
Measuring the groundedness and factuality of RAG-based search results
Validating prompt engineering changes to ensure no loss in quality