What automated metrics are included in this skill?

The skill supports a wide range of metrics including BLEU and ROUGE for text overlap, BERTScore for semantic similarity, and retrieval metrics like MRR and NDCG for RAG applications.

Can I use this skill to prevent performance regressions?

Yes, it includes a Regression Detector that compares new model outputs against a baseline and flags significant performance drops based on a configurable threshold.

Does it support statistical analysis for A/B tests?

Absolutely. It includes a statistical framework that performs T-tests and calculates Cohen's d to determine if improvements between model versions are statistically significant.

Is this suitable for evaluating RAG (Retrieval-Augmented Generation)?

Yes, it provides specific tools for measuring groundedness, context relevance, and retrieval precision to ensure RAG systems are accurate and helpful.

How does the 'LLM-as-Judge' pattern work?

It utilizes high-capability models to evaluate the outputs of other models based on specific dimensions like accuracy, helpfulness, and clarity, providing both scores and qualitative reasoning.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: HermeticOrmus

byHermeticOrmus

•

데이터 과학 및 ML

Implements rigorous evaluation strategies for LLM applications using automated metrics, human-in-the-loop feedback, and advanced benchmarking.

소개

This skill empowers developers to systematically measure and improve LLM performance through a comprehensive multi-layered approach. It provides standardized implementations for automated metrics like BLEU and BERTScore, sophisticated LLM-as-judge patterns for semantic quality, and statistical frameworks for A/B testing and regression detection. By integrating these strategies, teams can move beyond anecdotal testing to data-driven model selection, prompt engineering, and production monitoring, ensuring reliable AI outputs and preventing performance regressions over time.

주요 기능

Statistical A/B testing framework with Cohen's d effect size calculations
Human evaluation frameworks with inter-rater agreement (Cohen's Kappa) tools
2 GitHub stars
Comprehensive automated metrics suite including BLEU, ROUGE, and BERTScore
LLM-as-Judge patterns for automated semantic scoring and pairwise comparisons
Automated regression detection to prevent quality drops during model updates

사용 사례

Automating quality assurance and groundedness checks for RAG systems
Establishing performance baselines to detect subtle regressions in production AI
Benchmarking different LLM models or prompt iterations before deployment

소개

주요 기능

Statistical A/B testing framework with Cohen's d effect size calculations
Human evaluation frameworks with inter-rater agreement (Cohen's Kappa) tools
2 GitHub stars
Comprehensive automated metrics suite including BLEU, ROUGE, and BERTScore
LLM-as-Judge patterns for automated semantic scoring and pairwise comparisons
Automated regression detection to prevent quality drops during model updates

사용 사례

Automating quality assurance and groundedness checks for RAG systems
Establishing performance baselines to detect subtle regressions in production AI
Benchmarking different LLM models or prompt iterations before deployment