What metrics are supported for text generation evaluation?

The skill supports standard metrics like BLEU, ROUGE, and METEOR for overlap, as well as embedding-based BERTScore for measuring semantic similarity.

Does it support statistical significance in A/B testing?

Yes, the skill includes a statistical testing framework using T-tests and Cohen’s d to ensure that performance improvements between model versions are mathematically significant.

Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, it includes specialized retrieval metrics such as Mean Reciprocal Rank (MRR), NDCG, and Precision@K, along with custom metrics for groundedness.

How does the 'LLM-as-judge' feature work?

It provides implementation patterns to use a high-reasoning model to evaluate the outputs of other models through pointwise scoring, pairwise comparisons, or reference-based judging.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: amurata

byamurata

•

数据科学与机器学习

Implements comprehensive evaluation frameworks for Large Language Model applications using automated metrics, human feedback, and LLM-as-judge patterns.

This skill provides developers with a robust toolkit for measuring and improving LLM application quality throughout the development lifecycle. It covers a wide spectrum of evaluation techniques, including linguistic metrics like BLEU and ROUGE, semantic similarity via BERTScore, RAG-specific retrieval metrics, and sophisticated LLM-as-judge patterns for qualitative assessment. By integrating systematic testing, A/B comparison, and regression detection, it helps teams build confidence in production AI systems, validate prompt engineering improvements, and maintain rigorous performance standards over time.

主要功能

01Regression detection to prevent performance drops during model updates

023 GitHub stars

03LLM-as-judge patterns for automated qualitative scoring and comparisons

04Statistical A/B testing framework with significance and effect size analysis

05RAG performance tracking for retrieval accuracy and groundedness

06Automated NLP metrics including BLEU, ROUGE, and BERTScore

使用场景

01Comparing performance and costs across different LLM providers and versions

02Validating prompt engineering changes and system instructions before deployment

03Measuring the accuracy and retrieval quality of RAG-based knowledge systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add amurata/cc-tools llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill