What metrics does this skill support for text generation?

It supports a wide range of metrics including BLEU for translation, ROUGE for summarization, BERTScore for semantic similarity, and custom groundedness checks for RAG applications.

Can I use this for statistical A/B testing?

Yes, the skill includes a statistical testing framework that performs T-tests and calculates effect sizes (Cohen's d) to determine if model improvements are statistically significant.

How does it handle regression testing?

It features a RegressionDetector that compares new model results against a baseline and flags any metric decreases that exceed a defined significance threshold.

Does it provide support for human-in-the-loop evaluation?

Yes, it provides standardized annotation task structures and utilities to calculate inter-rater agreement using Cohen's Kappa score to ensure evaluation consistency.

How does the LLM-as-judge pattern work?

This pattern uses a highly capable model to evaluate the outputs of other models based on specific criteria like accuracy, helpfulness, and clarity, providing a scalable alternative to human review.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: drgaciw

bydrgaciw

データサイエンスとML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.

概要

This skill provides a robust framework for measuring and optimizing the quality of AI-driven applications. It equips developers with a suite of tools ranging from traditional NLP metrics like BLEU and ROUGE to modern 'LLM-as-judge' patterns and statistical A/B testing. Whether you are validating RAG performance, comparing model providers, or setting up automated regression testing in a CI/CD pipeline, this skill offers the implementation patterns needed to move from anecdotal prompt testing to data-driven performance engineering.

主な機能

Human evaluation structures and inter-rater agreement (Cohen's Kappa) tracking
Automated metrics integration including BLEU, ROUGE, and BERTScore
0 GitHub stars
Automated regression detection to identify performance drops before deployment
LLM-as-judge patterns for pointwise and pairwise semantic evaluation
Statistical A/B testing framework with T-test and Cohen's d analysis

ユースケース

Measuring RAG pipeline quality using retrieval and groundedness metrics
Systematically comparing model outputs after prompt engineering changes
Establishing performance baselines for production AI monitoring

概要

主な機能

Human evaluation structures and inter-rater agreement (Cohen's Kappa) tracking
Automated metrics integration including BLEU, ROUGE, and BERTScore
0 GitHub stars
Automated regression detection to identify performance drops before deployment
LLM-as-judge patterns for pointwise and pairwise semantic evaluation
Statistical A/B testing framework with T-test and Cohen's d analysis

ユースケース

Measuring RAG pipeline quality using retrieval and groundedness metrics
Systematically comparing model outputs after prompt engineering changes
Establishing performance baselines for production AI monitoring