What automated metrics are included in this skill?

The skill supports a wide range of metrics including BLEU and ROUGE for text overlap, BERTScore for semantic similarity, and retrieval metrics like MRR and NDCG for RAG systems.

How does the LLM-as-Judge pattern work?

It uses a highly capable model (like Claude 3.5 Sonnet or GPT-4) to grade the outputs of other models based on specific criteria like accuracy, helpfulness, and clarity using pointwise or pairwise comparison.

Does it support human evaluation workflows?

Yes, it provides structures for human annotation tasks, guidelines for manual rating, and calculations for inter-rater agreement using Cohen's Kappa score.

Can I detect if a new prompt version is worse than the previous one?

Yes, the skill includes a Regression Detector that compares new results against a baseline and flags significant decreases in performance based on a configurable threshold.

Is this suitable for testing RAG applications?

Absolutely. It contains specific metrics for retrieval quality (Precision@K) and generation quality (groundedness checks) essential for RAG development.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: lifangda

bylifangda

•

データサイエンスとML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and comparative benchmarking.

The LLM Evaluation skill provides a robust framework for measuring and improving the quality of Large Language Model applications. It enables developers to move beyond vibes-based testing by implementing standardized evaluation strategies ranging from classical NLP metrics like BLEU and ROUGE to modern approaches such as LLM-as-Judge and semantic similarity via BERTScore. By integrating automated testing, human annotation guidelines, and statistical A/B testing, this skill helps development teams identify performance regressions, validate prompt engineering improvements, and build production-ready AI systems with measurable confidence.

主な機能

01Regression detection tools to identify performance drops across different model versions.

02Specialized RAG metrics such as MRR, NDCG, and groundedness/faithfulness checks.

0316 GitHub stars

04LLM-as-Judge patterns for automated semantic assessment and pairwise comparisons.

05Statistical A/B testing framework to measure improvement significance and effect size.

06Automated NLP metrics including BLEU, ROUGE, and BERTScore for text generation quality.

ユースケース

01Measuring the retrieval and generation quality of RAG-based knowledge systems.

02Establishing quality baselines and tracking progress for AI agents and chatbots.

03Validating prompt engineering changes and model upgrades before production release.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add lifangda/claude-plugins llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill