Does it support statistical significance testing?

Yes, it includes a built-in A/B testing framework that calculates p-values and Cohen's d effect sizes to ensure your improvements are statistically valid.

What metrics are supported for RAG applications?

The skill includes RAG-specific metrics like Mean Reciprocal Rank (MRR), NDCG, and Precision@K, along with custom checks for groundedness and factuality.

How does it detect performance regressions?

It features a dedicated Regression Detector class that compares current test results against a baseline and flags any metrics that drop below a configurable threshold.

Can I use this for model-to-model comparisons?

Yes, it provides pairwise comparison patterns using an LLM-as-judge approach to determine which model or prompt variation produces better results.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: pur3v4d3r

bypur3v4d3r

•

Data Science & ML

Implements systematic evaluation strategies for AI applications using automated metrics, human feedback loops, and LLM-as-judge patterns.

This skill provides a comprehensive toolkit for measuring and improving the performance of Large Language Model applications. It guides developers through the implementation of standard NLP metrics like BLEU and ROUGE, sophisticated embedding-based scores like BERTScore, and modern LLM-as-judge evaluation patterns. Whether you are building RAG pipelines, chatbots, or classification agents, this skill helps establish rigorous baselines, perform A/B testing with statistical significance, and detect performance regressions before they reach production. It bridges the gap between raw model output and production-grade reliability by quantifying quality across dimensions like accuracy, groundedness, and coherence.

Key Features

01RAG-specific evaluation for retrieval quality and groundedness

021 GitHub stars

03Automated metrics implementation including BLEU, ROUGE, and BERTScore

04Automated regression detection to prevent performance degradation

05LLM-as-judge patterns for pointwise and pairwise model comparisons

06Statistical A/B testing framework with p-value and Cohen's d analysis

Use Cases

01Validating RAG system accuracy and retrieval relevance

02Establishing quality benchmarks for production AI agents

03Comparing performance between different model versions or prompt strategies

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add pur3v4d3r/pur3-pkb-codebase llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill