Which automated metrics does this skill support?

It supports a wide range of metrics including BLEU, ROUGE, and METEOR for text generation, BERTScore for semantic similarity, and RAG-specific metrics like MRR and NDCG.

Does it support statistical analysis for model comparisons?

Yes, it includes an A/B testing framework that calculates T-tests, p-values, and Cohen's d to determine if improvements are statistically significant.

How does this skill help with RAG evaluation?

It provides specific metrics for measuring retrieval accuracy, groundedness, and context relevance, ensuring that RAG systems are providing factually supported answers.

Can I use one LLM to evaluate another?

Yes, the skill includes 'LLM-as-judge' patterns for both single-output scoring (pointwise) and pairwise comparisons to determine which model response is superior.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: Berkay2002

byBerkay2002

0•

Data Science & ML

Implements comprehensive evaluation strategies for Large Language Models using automated metrics, human feedback, and comparative benchmarking.

The llm-evaluation skill provides a robust toolkit for systematically assessing the performance, quality, and reliability of AI applications. It enables developers to implement standard automated metrics like BLEU and BERTScore, establish sophisticated LLM-as-judge patterns for semantic quality, and perform rigorous A/B testing with statistical significance. By bridging the gap between raw model output and production-grade reliability, this skill helps teams validate prompt engineering changes, detect regressions early in the development lifecycle, and build verifiable confidence in complex agentic or RAG-based systems.

Key Features

01Statistical A/B testing framework with Cohen’s d effect size analysis

020 GitHub stars

03Automated regression detection to identify performance drops before deployment

04Implementation of automated metrics including BLEU, ROUGE, and BERTScore

05LLM-as-judge patterns for pointwise and pairwise model comparison

06Retrieval-specific metrics for RAG systems like MRR, NDCG, and groundedness

Use Cases

01Detecting performance regressions in production AI pipelines

02Measuring and validating RAG system performance for specialized domain knowledge

03Comparing model responses across different prompts or fine-tuning iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add berkay2002/agentic-rag-test-scope-analysis llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill