What metrics are included for RAG evaluation?

The skill includes specific retrieval metrics like Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), and Precision/Recall at K to measure search and context quality.

Can I use this for human-in-the-loop testing?

Yes, the skill provides structured annotation guidelines and tools to calculate inter-rater agreement using Cohen's Kappa for manual assessment.

How does the LLM-as-Judge feature work?

It implements patterns for pointwise scoring and pairwise comparison where a more capable model evaluates the output of another model based on custom dimensions like accuracy and helpfulness.

Does it support regression testing?

Yes, it includes a RegressionDetector class that compares current results against a baseline to flag significant performance decreases before deployment.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: kivilaid

bykivilaid

•

데이터 과학 및 ML

Implements systematic evaluation frameworks for LLM applications using automated metrics, human-in-the-loop feedback, and A/B testing.

소개

This skill provides a comprehensive toolkit for developers and data scientists to objectively measure the quality, accuracy, and reliability of Large Language Model applications. It bridges the gap between raw model output and production-ready software by facilitating automated scoring with metrics like BERTScore, implementing LLM-as-Judge patterns for semantic validation, and providing statistical frameworks for A/B testing and regression detection. Whether you are fine-tuning prompts or comparing model architectures, this skill ensures your AI features meet high-quality standards through rigorous, data-driven assessment.

주요 기능

RAG-specific retrieval metrics like NDCG and Mean Reciprocal Rank (MRR)
LLM-as-Judge patterns for semantic and qualitative scoring
Automated regression detection to prevent performance drops during updates
Automated NLP metrics including BLEU, ROUGE, and BERTScore
Statistical A/B testing framework with Cohen's d effect size analysis
6 GitHub stars

사용 사례

Comparing model performance across different prompts or LLM providers
Detecting performance regressions in production AI workflows
Measuring the groundedness and factuality of RAG-based systems

소개

주요 기능

RAG-specific retrieval metrics like NDCG and Mean Reciprocal Rank (MRR)
LLM-as-Judge patterns for semantic and qualitative scoring
Automated regression detection to prevent performance drops during updates
Automated NLP metrics including BLEU, ROUGE, and BERTScore
Statistical A/B testing framework with Cohen's d effect size analysis
6 GitHub stars

사용 사례

Comparing model performance across different prompts or LLM providers
Detecting performance regressions in production AI workflows
Measuring the groundedness and factuality of RAG-based systems