Does this skill help with production regressions?

Yes, it includes a RegressionDetector framework that compares new results against established baselines to flag significant performance drops before they reach users.

Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, it includes specific strategies for measuring retrieval performance, coverage, and checking groundedness to ensure the model stays within the provided context.

How does the LLM-as-judge feature work?

It provides implementation patterns where a more capable model (like Claude 3.5 Sonnet or GPT-4o) evaluates a smaller or specialized model's output based on custom criteria like accuracy and clarity.

What metrics does this skill support?

It supports automated metrics like BLEU and ROUGE for text overlap, semantic metrics like BERTScore, retrieval metrics like MRR and NDCG, and custom qualitative metrics via LLM-as-judge patterns.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: EricGrill

byEricGrill

•

数据科学与机器学习

Implements robust evaluation frameworks for AI applications using automated metrics, human feedback, and LLM-as-judge patterns.

This skill provides a comprehensive toolkit for assessing the performance and reliability of Large Language Model (LLM) applications. It enables developers to implement standardized evaluation strategies ranging from traditional NLP metrics like BLEU and ROUGE to modern LLM-as-judge patterns and human-in-the-loop annotation workflows. By establishing consistent baselines, detecting performance regressions, and facilitating statistical A/B testing, this skill helps teams build confidence in their production AI systems and systematically validate improvements to prompts, models, and retrieval strategies.

主要功能

01LLM-as-Judge patterns for qualitative and pairwise assessments

02RAG-specific evaluation metrics like MRR, NDCG, and groundedness

03Statistical A/B testing framework with significance and effect size

042 GitHub stars

05Automated NLP metrics including BLEU, ROUGE, and BERTScore

06Regression detection to prevent performance degradation

使用场景

01Validating prompt engineering changes before production deployment

02Measuring the accuracy and retrieval quality of RAG-based systems

03Comparing performance and cost-efficiency across different LLM models

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ericgrill/agents-skills-plugins llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill