How does the LLM-as-Judge pattern work?

It leverages high-capability models to evaluate the outputs of other models based on defined criteria like helpfulness, accuracy, and clarity, providing a scalable alternative to human review.

What metrics are included for RAG evaluation?

The skill includes retrieval-specific metrics such as Mean Reciprocal Rank (MRR), NDCG, and Precision/Recall at K to evaluate the accuracy and relevance of RAG retrieval systems.

Does it support human-in-the-loop evaluation?

Absolutely. It provides structured annotation guidelines and tools to calculate inter-rater agreement (Cohen's Kappa) to ensure manual quality assessments are consistent and reliable.

Can I use this for automated CI/CD testing?

Yes, it includes a Regression Detector module specifically designed to compare new model results against established baselines and flag significant performance drops automatically.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: carloscroc

bycarloscroc

0•

Ciencia de Datos y ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and comparative benchmarking.

The LLM Evaluation skill provides a robust framework for assessing the quality and performance of AI applications. It enables developers to systematically measure model outputs using automated metrics like BLEU and BERTScore, establish 'LLM-as-Judge' patterns for semantic validation, and conduct rigorous statistical A/B testing. By integrating regression detection and human-in-the-loop annotation workflows, this skill helps development teams build confidence in their AI systems, identify performance drifts early, and validate improvements from prompt engineering or model swaps with data-driven evidence.

Características Principales

010 GitHub stars

02LLM-as-Judge patterns for pointwise and pairwise semantic evaluation

03Statistical A/B testing framework with Cohen’s d effect size analysis

04Retrieval evaluation for RAG systems using MRR, NDCG, and Precision@K

05Automated regression detection to prevent performance degradation

06Automated text generation metrics including BLEU, ROUGE, and BERTScore

Casos de Uso

01Comparing performance and accuracy when migrating between different LLM providers

02Validating the impact of prompt engineering changes on model groundedness

03Establishing automated quality gates within CI/CD pipelines for AI applications

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add carloscroc/fitness llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill