What metrics are supported for RAG evaluation?

The skill supports industry-standard retrieval metrics including Mean Reciprocal Rank (MRR), NDCG, and Precision/Recall at K to measure context relevance and retrieval accuracy.

Does it support statistical significance testing?

Yes, it includes a statistical testing framework that utilizes T-tests and Cohen's d to determine if changes in model performance are statistically significant or negligible.

Can I use this for human evaluation workflows?

Yes, it includes structured frameworks for human annotation tasks, quality dimensions (coherence, relevance, safety), and tools for calculating inter-rater agreement.

How does the LLM-as-judge feature work?

It provides implementation patterns for using highly capable models to grade outputs from other models through pointwise scoring or pairwise comparisons against a gold standard.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: camoneart

bycamoneart

•

Ciencia de Datos y ML

Implements systematic evaluation strategies for Large Language Model applications using automated metrics, LLM-as-judge patterns, and statistical testing.

Acerca de

This skill provides a comprehensive toolkit for benchmarking and validating AI application quality, enabling developers to move beyond subjective testing. It facilitates the implementation of rigorous automated metrics like BLEU and BERTScore, retrieval-specific evaluations for RAG systems, and sophisticated LLM-as-judge frameworks for qualitative assessment. By incorporating statistical A/B testing and regression detection, it ensures that prompt engineering and model updates lead to measurable improvements while maintaining production stability.

Características Principales

RAG-specific evaluation for retrieval (MRR, NDCG) and groundedness
Statistical A/B testing framework with Cohen's d effect size analysis
LLM-as-judge patterns for pointwise and pairwise model comparisons
3 GitHub stars
Automated text metrics including BLEU, ROUGE, and BERTScore
Automated regression detection to prevent performance drops during updates

Casos de Uso

Validating RAG system retrieval quality and factual groundedness
Comparing performance across different LLM models or prompt iterations
Establishing production baselines and automated CI/CD regression testing

Acerca de

Características Principales

RAG-specific evaluation for retrieval (MRR, NDCG) and groundedness
Statistical A/B testing framework with Cohen's d effect size analysis
LLM-as-judge patterns for pointwise and pairwise model comparisons
3 GitHub stars
Automated text metrics including BLEU, ROUGE, and BERTScore
Automated regression detection to prevent performance drops during updates

Casos de Uso

Validating RAG system retrieval quality and factual groundedness
Comparing performance across different LLM models or prompt iterations
Establishing production baselines and automated CI/CD regression testing