Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, it includes specific logic for evaluating retrieval performance (MRR, NDCG) and custom metrics for checking groundedness against a provided context.

What is the 'LLM-as-Judge' feature?

It is a pattern that uses a more capable model (like GPT-4o or Claude 3.5 Sonnet) to evaluate the outputs of other models based on criteria like accuracy, helpfulness, and clarity.

Is statistical significance testing included?

The skill provides an A/B testing framework that calculates p-values and Cohen's d to determine if model improvements are statistically meaningful.

Does it help with regression testing?

Yes, it includes a RegressionDetector class that compares new evaluation results against historical baselines to flag significant drops in quality.

What metrics does this skill support?

It supports standard text metrics like BLEU and ROUGE, embedding-based metrics like BERTScore, and RAG-specific metrics like MRR and NDCG.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: sickn33

bysickn33

•

15,684

•

数据科学与机器学习

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback loops, and benchmarking.

This skill provides a robust framework for measuring and improving the quality of AI applications by implementing standardized evaluation patterns. It covers a wide spectrum of testing methodologies, including automated linguistic metrics (BLEU, ROUGE, BERTScore), retrieval-specific measurements for RAG systems, and advanced 'LLM-as-judge' patterns. Whether you are comparing model versions, detecting performance regressions, or establishing production baselines, this skill equips Claude with the tools to validate prompt changes and ensure model reliability through statistical analysis and systematic benchmarking.

主要功能

01RAG evaluation metrics for retrieval systems such as MRR and NDCG

02LLM-as-Judge patterns for automated pointwise and pairwise comparisons

0315,684 GitHub stars

04Automated text generation metrics including BLEU, ROUGE, and BERTScore

05A/B testing framework with statistical significance and effect size analysis

06Regression detection to identify performance drops before deployment

使用场景

01Comparing different models or prompts to identify the most effective configuration

02Measuring the groundedness and accuracy of RAG-based systems against ground truth

03Detecting performance regressions in production AI systems during CI/CD

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add sickn33/antigravity-awesome-skills llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill