How does the LLM-as-Judge pattern work?

It leverages high-capability models to evaluate the outputs of other models based on defined criteria like helpfulness, accuracy, and clarity, providing a scalable alternative to human review.

What metrics are included for RAG evaluation?

The skill includes retrieval-specific metrics such as Mean Reciprocal Rank (MRR), NDCG, and Precision/Recall at K to evaluate the accuracy and relevance of RAG retrieval systems.

Does it support human-in-the-loop evaluation?

Absolutely. It provides structured annotation guidelines and tools to calculate inter-rater agreement (Cohen's Kappa) to ensure manual quality assessments are consistent and reliable.

Can I use this for automated CI/CD testing?

Yes, it includes a Regression Detector module specifically designed to compare new model results against established baselines and flag significant performance drops automatically.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: carloscroc

bycarloscroc

数据科学与机器学习

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and comparative benchmarking.

关于

The LLM Evaluation skill provides a robust framework for assessing the quality and performance of AI applications. It enables developers to systematically measure model outputs using automated metrics like BLEU and BERTScore, establish 'LLM-as-Judge' patterns for semantic validation, and conduct rigorous statistical A/B testing. By integrating regression detection and human-in-the-loop annotation workflows, this skill helps development teams build confidence in their AI systems, identify performance drifts early, and validate improvements from prompt engineering or model swaps with data-driven evidence.

主要功能

0 GitHub stars
LLM-as-Judge patterns for pointwise and pairwise semantic evaluation
Statistical A/B testing framework with Cohen’s d effect size analysis
Retrieval evaluation for RAG systems using MRR, NDCG, and Precision@K
Automated regression detection to prevent performance degradation
Automated text generation metrics including BLEU, ROUGE, and BERTScore

使用场景

Comparing performance and accuracy when migrating between different LLM providers
Validating the impact of prompt engineering changes on model groundedness
Establishing automated quality gates within CI/CD pipelines for AI applications

关于

主要功能

0 GitHub stars
LLM-as-Judge patterns for pointwise and pairwise semantic evaluation
Statistical A/B testing framework with Cohen’s d effect size analysis
Retrieval evaluation for RAG systems using MRR, NDCG, and Precision@K
Automated regression detection to prevent performance degradation
Automated text generation metrics including BLEU, ROUGE, and BERTScore

使用场景

Comparing performance and accuracy when migrating between different LLM providers
Validating the impact of prompt engineering changes on model groundedness
Establishing automated quality gates within CI/CD pipelines for AI applications