Does it support human-in-the-loop evaluation?

Yes, it provides structured frameworks for human annotation tasks and includes tools to calculate inter-rater agreement using Cohen’s kappa score.

What metrics are supported for RAG applications?

The skill includes specialized retrieval metrics like MRR, NDCG, and Precision@K, as well as NLI-based groundedness checks to ensure answers are supported by the provided context.

Can I use this for automated regression testing?

Yes, the skill includes a RegressionDetector that compares new results against a baseline and flags significant performance drops based on a configurable sensitivity threshold.

How does the LLM-as-judge pattern work?

It uses a highly capable model to evaluate the outputs of other models based on defined criteria like accuracy, helpfulness, and clarity, supporting both single-output scoring and pairwise comparisons.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: nguyendinhquocx

bynguyendinhquocx

0•

Data Science & ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

The LLM Evaluation skill provides a robust framework for measuring and improving the quality of AI-powered applications. It enables developers to systematically assess performance through traditional NLP metrics like BLEU and ROUGE, modern embedding-based scores like BERTScore, and advanced 'LLM-as-judge' patterns. By facilitating automated regression testing, human annotation workflows, and statistical A/B testing, this skill helps development teams build confidence in production AI systems, validate prompt engineering improvements, and detect performance drifts before they impact end-users.

Key Features

01Automated metrics integration including BLEU, ROUGE, BERTScore, and Perplexity

02RAG-specific evaluation metrics like MRR, NDCG, and Groundedness

030 GitHub stars

04LLM-as-judge implementation for pointwise and pairwise model comparisons

05Automated regression detection to identify performance degradation between versions

06Statistical A/B testing framework with Cohen’s d effect size calculation

Use Cases

01Validating Retrieval-Augmented Generation (RAG) accuracy and factual groundedness

02Establishing a continuous evaluation pipeline within CI/CD for AI application stability

03Benchmarking different model versions or prompt templates to find the optimal configuration

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add nguyendinhquocx/code-ai llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill