Which automated metrics are supported?

The skill supports a wide range of metrics including BLEU and ROUGE for text similarity, BERTScore for semantic meaning, and RAG-specific metrics like MRR and NDCG.

How does the LLM-as-Judge pattern work?

It utilizes a more powerful model (like GPT-4 or Claude 3.5 Sonnet) to act as a grader, comparing model outputs against a gold standard or judging them based on specific criteria like helpfulness and safety.

Does it support statistical analysis?

Yes, it includes an A/B testing framework that calculates T-tests, p-values, and Cohen's d to determine if improvements are statistically significant.

Can this skill help with RAG (Retrieval-Augmented Generation)?

Yes, it includes specialized metrics for retrieval performance and groundedness checks to ensure the model is correctly using provided context.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: HermeticOrmus

byHermeticOrmus

データサイエンスとML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and comparative benchmarking.

概要

The LLM Performance Evaluation skill provides a robust framework for assessing the quality, reliability, and accuracy of Large Language Model outputs. It equips developers with tools for automated NLP metrics like BLEU and BERTScore, advanced 'LLM-as-judge' patterns for semantic grading, and structured human annotation workflows. Whether you are comparing model variants, validating prompt engineering changes, or setting up regression testing in a CI/CD pipeline, this skill ensures your AI applications meet production-grade standards through rigorous statistical analysis and multi-dimensional scoring.

主な機能

LLM-as-Judge patterns for pointwise and pairwise evaluation
Regression detection to prevent performance drift in updates
Automated NLP metrics including BLEU, ROUGE, and BERTScore
Retrieval-specific metrics for RAG systems like MRR and NDCG
0 GitHub stars
Statistical A/B testing framework with Cohen's d effect size

ユースケース

Measuring the impact of prompt engineering on output quality
Benchmarking different foundation models for specific use cases
Establishing automated quality gates for AI deployment pipelines

概要

主な機能

LLM-as-Judge patterns for pointwise and pairwise evaluation
Regression detection to prevent performance drift in updates
Automated NLP metrics including BLEU, ROUGE, and BERTScore
Retrieval-specific metrics for RAG systems like MRR and NDCG
0 GitHub stars
Statistical A/B testing framework with Cohen's d effect size

ユースケース

Measuring the impact of prompt engineering on output quality
Benchmarking different foundation models for specific use cases
Establishing automated quality gates for AI deployment pipelines