What metrics are supported for RAG evaluation?

The skill supports industry-standard retrieval metrics including Mean Reciprocal Rank (MRR), NDCG, and Precision/Recall at K to measure context relevance and retrieval accuracy.

Does it support statistical significance testing?

Yes, it includes a statistical testing framework that utilizes T-tests and Cohen's d to determine if changes in model performance are statistically significant or negligible.

Can I use this for human evaluation workflows?

Yes, it includes structured frameworks for human annotation tasks, quality dimensions (coherence, relevance, safety), and tools for calculating inter-rater agreement.

How does the LLM-as-judge feature work?

It provides implementation patterns for using highly capable models to grade outputs from other models through pointwise scoring or pairwise comparisons against a gold standard.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: camoneart

bycamoneart

•

Data Science & ML

Implements systematic evaluation strategies for Large Language Model applications using automated metrics, LLM-as-judge patterns, and statistical testing.

This skill provides a comprehensive toolkit for benchmarking and validating AI application quality, enabling developers to move beyond subjective testing. It facilitates the implementation of rigorous automated metrics like BLEU and BERTScore, retrieval-specific evaluations for RAG systems, and sophisticated LLM-as-judge frameworks for qualitative assessment. By incorporating statistical A/B testing and regression detection, it ensures that prompt engineering and model updates lead to measurable improvements while maintaining production stability.

Key Features

01RAG-specific evaluation for retrieval (MRR, NDCG) and groundedness

02Statistical A/B testing framework with Cohen's d effect size analysis

03LLM-as-judge patterns for pointwise and pairwise model comparisons

043 GitHub stars

05Automated text metrics including BLEU, ROUGE, and BERTScore

06Automated regression detection to prevent performance drops during updates

Use Cases

01Validating RAG system retrieval quality and factual groundedness

02Comparing performance across different LLM models or prompt iterations

03Establishing production baselines and automated CI/CD regression testing

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add camoneart/claude-code llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill