Can I use one model to evaluate another?

Yes, the skill includes 'LLM-as-Judge' patterns for pointwise, pairwise, and reference-based evaluation using advanced models like Claude 3.5 Sonnet.

Does it help with RAG evaluation?

Absolutely. It includes specific implementations for groundedness, factuality, and retrieval quality to verify that your RAG pipeline is using context correctly.

How does it handle human feedback?

It provides structured templates for human annotation tasks and includes utility functions to calculate inter-rater agreement (Cohen's Kappa) to ensure scoring reliability.

What automated metrics are supported?

The skill supports standard metrics like BLEU, ROUGE, and BERTScore, as well as classification metrics like F1 and RAG-specific metrics like NDCG and MRR.

Is it suitable for production monitoring?

Yes, the framework is designed to be integrated into your development workflow to track performance over time and detect quality regressions before deployment.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: rajasekarm

byrajasekarm

0•

Data Science & ML

Implements comprehensive evaluation strategies for Large Language Model applications using automated metrics, human feedback, and LLM-as-judge patterns.

The LLM Evaluation skill provides a robust framework for measuring and improving the performance of AI applications. It offers a standardized approach to benchmarking through automated metrics like BLEU and BERTScore, advanced LLM-as-judge evaluation patterns for qualitative assessment, and human-in-the-loop feedback systems. Whether you are comparing model iterations, detecting performance regressions, or validating RAG pipeline accuracy, this skill provides the tools necessary to ensure production-ready reliability for LLM-powered software.

Key Features

01Human evaluation frameworks with inter-rater agreement calculations

02Retrieval (RAG) performance tracking using MRR, NDCG, and Precision@K

030 GitHub stars

04Automated text metrics including BLEU, ROUGE, and BERTScore

05LLM-as-Judge patterns for qualitative scoring and pairwise comparison

06Custom metric implementation for groundedness, factuality, and toxicity

Use Cases

01Measuring the accuracy and groundedness of RAG-based search systems

02Comparing model performance across different prompts or model versions

03Building automated CI/CD pipelines for LLM regression testing

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add rajasekarm/skills llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill