How do I measure the performance of a RAG system?

RAG systems are typically evaluated using retrieval metrics like MRR (Mean Reciprocal Rank) and NDCG to test the search component, and groundedness metrics to ensure the generation component only uses provided context.

Why should I use BERTScore instead of BLEU?

While BLEU focuses on exact word overlap, BERTScore uses contextual embeddings to measure semantic similarity, allowing it to reward responses that mean the same thing even if they use different vocabulary.

What is a 'Golden Set' in LLM evaluation?

A Golden Set is a curated list of input prompts and their corresponding 'perfect' or reference outputs used as a benchmark to consistently measure and compare model performance over time.

What is the LLM-as-judge evaluation pattern?

LLM-as-judge is an evaluation method where a high-capability model, such as Claude 3.5 Sonnet, acts as a grader to assess the outputs of other models or prompts based on specific rubrics like helpfulness, safety, or accuracy.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: sla-te

bysla-te

0•

数据科学与机器学习

Implement comprehensive evaluation strategies for AI applications using automated metrics, LLM-as-judge patterns, and human feedback.

The LLM Evaluation skill provides a structured methodology for measuring the quality, accuracy, and reliability of Large Language Model applications. It empowers developers to move beyond vibes-based testing by implementing rigorous evaluation suites featuring traditional NLP metrics (BLEU, ROUGE), semantic embedding scores (BERTScore), and advanced LLM-as-judge patterns. By integrating these strategies, teams can systematically compare model versions, validate RAG pipeline retrieval, and detect performance regressions before they reach production users.

主要功能

01Automated metrics implementation including BLEU, ROUGE, and BERTScore

02RAG-specific retrieval metrics like MRR, NDCG, and Precision@K

030 GitHub stars

04Human evaluation frameworks with inter-rater agreement (Cohen's Kappa)

05LLM-as-judge patterns for semantic quality and pairwise comparison

06Statistical A/B testing and groundedness verification patterns

使用场景

01Establishing production baselines to detect performance regressions during deployment

02Measuring the factual accuracy and groundedness of RAG-based systems

03Comparing model performance across different prompt versions or system instructions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add sla-te/copilot-converter llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill