Is it possible to evaluate RAG systems with this skill?

Absolutely. It provides patterns for measuring groundedness, retrieval precision, and how well the model uses provided context.

Can I use this for regression testing?

Yes, the skill includes a RegressionDetector that compares current model results against a saved baseline to highlight performance drops.

How does LLM-as-Judge work?

It uses a highly capable model to act as a grader, evaluating smaller model outputs based on custom rubrics like accuracy, helpfulness, and tone.

What metrics are included in the LLM evaluation skill?

The skill includes automated text metrics (BLEU, ROUGE, BERTScore), classification metrics (F1, Accuracy), and RAG-specific metrics (MRR, NDCG).

Does it support human evaluation workflows?

Yes, it provides structures for human annotation tasks, rating scales, and tools to calculate inter-rater agreement using Cohen’s Kappa.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: JantonioFC

byJantonioFC

•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and comparative benchmarking.

This skill provides a robust toolkit for measuring and improving LLM application quality through systematic testing and validation. It bridges the gap between raw model outputs and production-grade reliability by offering implementation patterns for standard NLP metrics like BLEU and BERTScore, advanced LLM-as-Judge patterns, and human annotation frameworks. Developers can use this skill to establish performance baselines, run A/B tests between different prompts, and detect regressions before deploying changes to production, ensuring data-driven improvements to AI behavior.

Key Features

01RAG-specific evaluation metrics for retrieval and groundedness

02Automated regression detection to track performance across versions

03LLM-as-Judge patterns for automated pointwise and pairwise grading

04Statistical A/B testing and Cohen's d effect size analysis

05Automated NLP metrics including BLEU, ROUGE, and BERTScore

062 GitHub stars

Use Cases

01Benchmarking model performance across different prompt versions or LLM providers

02Validating the accuracy and relevance of RAG-based search results

03Establishing a systematic quality gate in CI/CD pipelines for AI agents

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jantoniofc/skillsbank llm-evaluation

For use in Claude.ai and ChatGPT

Key Features

01RAG-specific evaluation metrics for retrieval and groundedness

02Automated regression detection to track performance across versions

03LLM-as-Judge patterns for automated pointwise and pairwise grading

04Statistical A/B testing and Cohen's d effect size analysis

05Automated NLP metrics including BLEU, ROUGE, and BERTScore

062 GitHub stars

Use Cases

01Benchmarking model performance across different prompt versions or LLM providers

02Validating the accuracy and relevance of RAG-based search results

03Establishing a systematic quality gate in CI/CD pipelines for AI agents

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jantoniofc/skillsbank llm-evaluation

For use in Claude.ai and ChatGPT