What metrics are included for RAG evaluation?

The skill includes specific retrieval metrics like Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), and Precision/Recall at K to measure search and context quality.

Can I use this for human-in-the-loop testing?

Yes, the skill provides structured annotation guidelines and tools to calculate inter-rater agreement using Cohen's Kappa for manual assessment.

How does the LLM-as-Judge feature work?

It implements patterns for pointwise scoring and pairwise comparison where a more capable model evaluates the output of another model based on custom dimensions like accuracy and helpfulness.

Does it support regression testing?

Yes, it includes a RegressionDetector class that compares current results against a baseline to flag significant performance decreases before deployment.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: kivilaid

bykivilaid

•

データサイエンスとML

Implements systematic evaluation frameworks for LLM applications using automated metrics, human-in-the-loop feedback, and A/B testing.

This skill provides a comprehensive toolkit for developers and data scientists to objectively measure the quality, accuracy, and reliability of Large Language Model applications. It bridges the gap between raw model output and production-ready software by facilitating automated scoring with metrics like BERTScore, implementing LLM-as-Judge patterns for semantic validation, and providing statistical frameworks for A/B testing and regression detection. Whether you are fine-tuning prompts or comparing model architectures, this skill ensures your AI features meet high-quality standards through rigorous, data-driven assessment.

主な機能

01RAG-specific retrieval metrics like NDCG and Mean Reciprocal Rank (MRR)

02LLM-as-Judge patterns for semantic and qualitative scoring

03Automated regression detection to prevent performance drops during updates

04Automated NLP metrics including BLEU, ROUGE, and BERTScore

05Statistical A/B testing framework with Cohen's d effect size analysis

066 GitHub stars

ユースケース

01Comparing model performance across different prompts or LLM providers

02Detecting performance regressions in production AI workflows

03Measuring the groundedness and factuality of RAG-based systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add kivilaid/plugin-marketplace llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill