What is the LLM-as-Judge approach?

This technique uses a highly capable model (like Claude 3.5 Sonnet) to evaluate the outputs of other models or prompts based on specific rubrics like accuracy, helpfulness, and tone.

Can I use this for RAG applications?

Yes, it includes specific metrics for Retrieval-Augmented Generation (RAG) such as Mean Reciprocal Rank (MRR), NDCG, and groundedness checks to ensure context adherence.

What metrics does this skill support for text generation?

It supports industry-standard metrics including BLEU for overlap, ROUGE for summarization quality, and BERTScore for semantic similarity using embeddings.

Does it help with statistical significance?

Yes, the skill includes an A/B testing framework that calculates T-tests, p-values, and Cohen's d to help you determine if a change is statistically better than the baseline.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: Activer007

byActiver007

0•

데이터 과학 및 ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.

This skill provides a robust framework for assessing the quality and performance of Large Language Model (LLM) applications through a multi-layered approach. It encompasses automated text generation metrics like BLEU and BERTScore, classification performance tracking, and specialized RAG evaluation tools for retrieval accuracy. Beyond basic metrics, it facilitates advanced 'LLM-as-Judge' patterns, human annotation workflows, and statistical A/B testing to ensure model improvements are both significant and reliable. It is an essential tool for developers looking to build production-grade AI systems with measurable quality standards.

주요 기능

01Statistical A/B testing framework with p-value and effect size calculations

02Automated metrics including BLEU, ROUGE, and BERTScore for text similarity

03LLM-as-Judge patterns for automated pointwise and pairwise evaluations

04Comprehensive RAG metrics for measuring retrieval (MRR, NDCG) and groundedness

05Regression detection to identify performance drops during the development cycle

060 GitHub stars

사용 사례

01Comparing the performance and cost-effectiveness of different foundation models

02Establishing automated quality gates within CI/CD pipelines for AI applications

03Validating prompt engineering changes to ensure they improve rather than degrade output

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add activer007/ordinary-claude-skills llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill