Can this help with RAG applications?

Yes, it includes specific patterns for evaluating Retrieval-Augmented Generation, including context grounding checks and retrieval-specific metrics like MRR and Precision@K.

How does it handle regression testing?

It provides a RegressionDetector class that compares new evaluation results against a baseline, flagging significant decreases in performance based on a configurable threshold.

What is LLM-as-Judge?

LLM-as-Judge is a technique that uses a more capable model to evaluate the outputs of a smaller or task-specific model based on custom rubrics, providing qualitative scores that automated metrics often miss.

Which automated metrics are supported?

The skill includes implementations for standard metrics like BLEU, ROUGE, and BERTScore, as well as custom metrics for groundedness, toxicity, and factuality.

LLM Evaluation & Metrics

Name: LLM Evaluation & Metrics
Author: amurata

byamurata

•

数据科学与机器学习

Implements rigorous evaluation frameworks for Large Language Model applications using automated metrics, LLM-as-judge patterns, and human feedback loops.

关于

This skill provides a comprehensive toolkit for measuring and improving the quality of AI-driven applications. It covers everything from standard NLP metrics like BLEU and ROUGE to advanced 'LLM-as-Judge' methodologies and statistical A/B testing frameworks. By establishing systematic evaluation baselines, developers can confidently detect performance regressions, compare different model versions, and validate prompt engineering improvements throughout the software development lifecycle.

主要功能

Statistical A/B testing framework with Cohen's d effect size analysis
Regression detection to prevent performance drops before deployment
Automated NLP metrics including BLEU, ROUGE, and BERTScore
LLM-as-Judge patterns for pointwise and pairwise qualitative assessment
Human evaluation structures with inter-rater agreement calculation
3 GitHub stars

使用场景

Establishing groundedness and factuality baselines for RAG systems
Detecting quality regressions in CI/CD pipelines for AI applications
Comparing the performance of different foundation models or prompt iterations

关于

主要功能

Statistical A/B testing framework with Cohen's d effect size analysis
Regression detection to prevent performance drops before deployment
Automated NLP metrics including BLEU, ROUGE, and BERTScore
LLM-as-Judge patterns for pointwise and pairwise qualitative assessment
Human evaluation structures with inter-rater agreement calculation
3 GitHub stars

使用场景

Establishing groundedness and factuality baselines for RAG systems
Detecting quality regressions in CI/CD pipelines for AI applications
Comparing the performance of different foundation models or prompt iterations