Does it help with statistical significance in A/B testing?

Absolutely. It includes a statistical testing framework that calculates p-values and Cohen's d effect sizes to determine if improvements are statistically significant.

What metrics does the LLM Evaluation skill support?

It supports a wide range of metrics including automated scores (BLEU, ROUGE, BERTScore), RAG metrics (Precision@K, MRR), and classification metrics like F1-score and AUC-ROC.

How do I prevent my model's performance from degrading over time?

The skill features a RegressionDetector class that compares new evaluation results against a baseline and flags any metrics that drop below a defined threshold.

How does 'LLM-as-Judge' work in this skill?

It provides standardized patterns to use a more capable model to evaluate the output of another model, supporting both single-output scoring and pairwise comparisons.

Can I use this for RAG (Retrieval-Augmented Generation) apps?

Yes, the skill includes specific implementations for measuring context relevance, groundedness, and factual accuracy within RAG workflows.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: GaitanS

byGaitanS

0•

数据科学与机器学习

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking frameworks.

The LLM Evaluation skill provides a robust framework for systematically measuring and improving the performance of AI-powered applications. It enables developers to transition from anecdotal testing to rigorous validation by providing implementation patterns for automated metrics like BLEU and BERTScore, 'LLM-as-Judge' scoring, and statistical A/B testing. Whether you are optimizing RAG pipelines, comparing different foundation models, or preventing regressions during prompt engineering, this skill ensures your AI outputs meet production-grade standards for accuracy, safety, and helpfulness.

主要功能

01Statistical A/B testing framework with p-value and effect size analysis

02Automated NLP metrics including BLEU, ROUGE, and BERTScore

03RAG-specific evaluation for retrieval precision and groundedness

04Automated regression detection against established performance baselines

050 GitHub stars

06LLM-as-Judge patterns for pointwise and pairwise quality assessment

使用场景

01Validating RAG system performance using retrieval metrics and faithfulness checks

02Establishing a CI/CD evaluation pipeline to catch performance regressions before deployment

03Benchmarking different models or prompts to identify the highest quality configuration

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add gaitans/ai-mancare llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill