Does it support human-in-the-loop evaluation?

Yes, it provides frameworks for structuring human annotation tasks and calculating inter-rater agreement using Cohen's kappa scores.

Can I use this for regression testing?

Yes, it includes a dedicated RegressionDetector class to compare new results against baseline scores and flag significant performance drops automatically.

What metrics are supported for RAG evaluation?

The skill supports retrieval-specific metrics such as MRR (Mean Reciprocal Rank), NDCG, and Precision/Recall @K, as well as custom groundedness checks.

What is the LLM-as-judge approach?

This pattern involves using a more capable model to evaluate the outputs of other models based on specific criteria like accuracy, helpfulness, and clarity.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: Microck

byMicrock

•

Data Science & ML

Implements systematic evaluation strategies for Large Language Model applications using automated metrics, human feedback, and LLM-as-judge patterns.

About

This skill provides a comprehensive framework for measuring and improving the quality of AI applications. It covers essential evaluation methodologies including traditional NLP metrics like BLEU and ROUGE, advanced embedding-based scores, and modern LLM-as-judge techniques. By integrating automated regression detection, statistical A/B testing, and human annotation workflows, it helps developers build confidence in production systems, validate prompt improvements, and systematically compare model performance across diverse use cases such as RAG systems and text generation.

Key Features

RAG evaluation benchmarks including MRR, NDCG, and Groundedness
Statistical A/B testing framework with Cohen's d effect size analysis
Automated regression detection to prevent performance degradation
Implementation of automated NLP metrics (BLEU, ROUGE, BERTScore)
81 GitHub stars
LLM-as-Judge patterns for pairwise and pointwise assessment

Use Cases

Comparing different prompt versions to select the most effective instruction set
Detecting quality regressions before deploying model updates to production
Validating RAG system performance against factual context and ground truth

About

Key Features

RAG evaluation benchmarks including MRR, NDCG, and Groundedness
Statistical A/B testing framework with Cohen's d effect size analysis
Automated regression detection to prevent performance degradation
Implementation of automated NLP metrics (BLEU, ROUGE, BERTScore)
81 GitHub stars
LLM-as-Judge patterns for pairwise and pointwise assessment

Use Cases

Comparing different prompt versions to select the most effective instruction set
Detecting quality regressions before deploying model updates to production
Validating RAG system performance against factual context and ground truth