Does it support human evaluation workflows?

Yes, it provides a structured human evaluation framework including annotation guidelines, rating scales, and tools to calculate inter-rater agreement using Cohen's Kappa.

How does the LLM-as-Judge pattern work?

It utilizes a high-capability model to act as an automated grader for another model's output, providing structured JSON ratings for accuracy, helpfulness, and clarity based on a defined rubric.

Can I use this for regression testing in CI/CD?

Yes, the skill includes a RegressionDetector module designed to compare new model outputs against a baseline and flag significant performance drops based on a configurable threshold.

What metrics are supported for RAG evaluation?

The skill includes specific metrics for Retrieval-Augmented Generation such as Mean Reciprocal Rank (MRR), NDCG, Precision@K, and custom groundedness checks to ensure answers are based on context.

LLM Evaluation & Benchmarking

Name: LLM Evaluation & Benchmarking
Author: ngxtm

byngxtm

•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, LLM-as-Judge patterns, and human feedback loops.

This skill provides a robust toolkit for measuring and improving the performance of Large Language Model applications within the Claude Code environment. It enables developers to implement systematic evaluation strategies including classic NLP metrics like BLEU and ROUGE, advanced embedding-based scores like BERTScore, and sophisticated 'LLM-as-Judge' patterns for qualitative assessment. Whether you are optimizing RAG pipelines, comparing model variants through statistical A/B testing, or establishing regression suites to catch performance dips before deployment, this skill offers the implementation patterns and metrics needed to build production-grade AI with verifiable confidence.

Key Features

012 GitHub stars

02LLM-as-Judge patterns for pointwise scoring and pairwise model comparisons

03Statistical A/B testing framework with Cohen’s d effect size calculations

04Automated NLP metrics including BLEU, ROUGE, METEOR, and BERTScore

05RAG-specific evaluation for retrieval relevance, NDCG, and groundedness

06Regression detection systems to monitor performance stability over time

Use Cases

01Measuring the quality of RAG systems by tracking retrieval precision and factual grounding

02Validating the impact of prompt engineering iterations on response accuracy and safety

03Comparing the performance of different model providers or versions for specific domain tasks

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ngxtm/devkit llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill