What automated metrics are supported for text generation?

The skill supports a wide range of metrics including BLEU and ROUGE for overlap, METEOR for semantic similarity, and BERTScore for embedding-based evaluation.

How does the regression detection work?

It compares new evaluation results against a saved baseline. If scores for key metrics drop beyond a defined threshold, it flags the change as a regression to prevent low-quality updates from reaching production.

Can this skill help evaluate RAG systems?

Yes, it includes specific metrics for Retrieval-Augmented Generation such as Mean Reciprocal Rank (MRR), NDCG, and custom groundedness checks to ensure answers are based on provided context.

What is 'LLM-as-judge' and how is it used?

LLM-as-judge is a pattern where a highly capable model (like Claude 3.5 Sonnet) is used to score or compare the outputs of other models based on specific rubrics like helpfulness, safety, or accuracy.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: Activer007

byActiver007

0•

데이터 과학 및 ML

Implements rigorous evaluation strategies for AI applications using automated metrics, human feedback, and LLM-as-judge patterns.

The LLM Evaluation skill provides a comprehensive toolkit for assessing the quality, accuracy, and reliability of Large Language Model outputs. It enables developers to move beyond vibes-based testing by establishing robust evaluation frameworks that combine automated NLP metrics like BLEU and BERTScore with sophisticated LLM-as-judge patterns. Whether you are validating RAG systems, comparing different model providers, or performing A/B tests on prompt iterations, this skill helps you detect regressions, measure groundedness, and build production-ready AI applications with statistical confidence.

주요 기능

01Statistical A/B testing framework with Cohen's d effect size

02Specialized RAG metrics for retrieval quality and groundedness

03LLM-as-judge patterns for pointwise and pairwise semantic evaluation

04Automated regression detection to prevent performance degradation

050 GitHub stars

06Automated text metrics including BLEU, ROUGE, and BERTScore

사용 사례

01Comparing model performance across different versions or providers

02Detecting performance regressions before deploying new prompt templates

03Validating RAG system accuracy and document retrieval relevance

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add activer007/ordinary-claude-skills llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill