Can this skill help evaluate RAG applications?

Yes, it includes specialized patterns for RAG evaluation, focusing on groundedness, context relevance, and retrieval coverage to ensure your knowledge base is utilized correctly.

Which automated metrics are supported by this skill?

The skill supports text generation metrics (BLEU, ROUGE, BERTScore), classification metrics (Accuracy, F1, AUC-ROC), and retrieval metrics (MRR, NDCG, Precision@K).

What is the 'LLM-as-judge' approach?

LLM-as-judge involves using a highly capable model to evaluate the output of another model, providing automated qualitative scoring that closely mimics human judgment.

How does it handle regression testing in AI apps?

It provides a RegressionDetector framework that compares current evaluation scores against established baselines, flagging significant drops in performance automatically.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: carlopezzuto

bycarlopezzuto

0•

데이터 과학 및 ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and rigorous benchmarking.

The LLM Evaluation skill provides a robust framework for measuring and improving the performance of AI-driven applications. It enables developers to move beyond anecdotal testing by implementing systematic evaluation pipelines using traditional NLP metrics, embedding-based similarities, and advanced 'LLM-as-judge' patterns. By facilitating statistical A/B testing and automated regression detection, this skill helps engineering teams validate prompt changes, compare model variants, and establish production-grade baselines with data-driven confidence.

주요 기능

01LLM-as-Judge patterns for pointwise and pairwise qualitative assessment

02Automated regression detection to prevent performance drift in CI/CD

030 GitHub stars

04Automated NLP metrics including BLEU, ROUGE, and BERTScore

05Statistical A/B testing framework with Cohen's d effect size analysis

06RAG-specific evaluation for groundedness, relevance, and retrieval quality

사용 사례

01Measuring RAG system performance using retrieval and generation metrics

02Establishing systematic human-in-the-loop evaluation workflows

03Validating prompt engineering improvements and comparing model providers

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add carlopezzuto/agents llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill