How does the 'LLM-as-Judge' pattern work?

It uses more advanced models to evaluate the outputs of other models based on specific rubrics, offering both pointwise scoring and pairwise comparisons.

Is this skill useful for RAG (Retrieval-Augmented Generation)?

Absolutely. It contains specific metrics for measuring retrieval quality and checking if model responses are properly grounded in the provided context.

Does it require external libraries?

The implementation patterns utilize standard Python libraries such as NLTK, rouge-score, bert-score, and scipy for statistical analysis.

What automated metrics does this skill support?

It supports a wide range of metrics including linguistic overlap (BLEU, ROUGE), semantic similarity (BERTScore), and retrieval metrics for RAG systems (MRR, NDCG, Precision@K).

Can I use this to detect performance regressions?

Yes, the skill includes a RegressionDetector class designed to compare new model outputs against established baselines to flag significant performance drops.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: CarlosVerasteguii

byCarlosVerasteguii

0•

データサイエンスとML

Implements comprehensive evaluation strategies for Large Language Model applications using automated metrics, human feedback, and rigorous benchmarking.

The LLM Evaluation skill provides a robust framework for systematically assessing the performance of AI applications. It offers a standardized approach to measuring quality through multiple methodologies, including automated linguistic metrics (BLEU, ROUGE), semantic similarity (BERTScore), and specialized retrieval-augmented generation (RAG) metrics. By incorporating patterns for LLM-as-judge, human annotation workflows, and statistical A/B testing, this skill enables developers to move beyond vibes-based testing to data-driven model optimization and regression prevention.

主な機能

01Automated NLP metrics including BLEU, ROUGE, and BERTScore

02RAG-specific metrics for retrieval and groundedness assessment

03LLM-as-Judge patterns for scalable, automated quality grading

04Statistical A/B testing and regression detection frameworks

05Human evaluation workflows with inter-rater agreement tracking

060 GitHub stars

ユースケース

01Establishing automated quality gates in CI/CD pipelines for AI applications

02Comparing performance between different model versions or prompt strategies

03Debugging and validating RAG system accuracy and factual groundedness

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add carlosverasteguii/daticket llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill