Can I use this for human evaluation workflows?

Yes, it provides templates for human annotation tasks and functions to calculate inter-rater agreement using Cohen's Kappa.

What automated metrics does this skill support?

It supports text generation metrics (BLEU, ROUGE, BERTScore), classification metrics (F1, Accuracy), and retrieval metrics for RAG (MRR, NDCG).

Is this suitable for RAG applications?

Absolutely. It includes specific metrics for retrieval quality and groundedness checks to ensure answers are based on the provided context.

How does it help with regression testing?

The skill includes a RegressionDetector that compares new model results against a saved baseline and flags significant drops in performance.

What is the 'LLM-as-Judge' approach?

It is a technique where a high-capability model (like Claude 3.5 or GPT-4) is used to evaluate the outputs of other models based on defined rubrics like accuracy and helpfulness.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: HermeticOrmus

byHermeticOrmus

Ciencia de Datos y ML

Implement comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and statistical benchmarking.

Acerca de

This skill provides a robust toolkit for measuring and optimizing the performance of Large Language Model applications. It equips developers with a variety of evaluation strategies, ranging from traditional NLP metrics like BLEU and ROUGE to modern embedding-based assessments like BERTScore. By providing implementation patterns for LLM-as-judge, automated regression detection, and statistical A/B testing, it ensures that AI systems remain accurate, safe, and reliable. Whether you are validating a RAG pipeline or comparing prompt versions, this skill helps establish rigorous baselines and data-driven quality control.

Características Principales

LLM-as-Judge patterns for pointwise and pairwise assessment
Statistical A/B testing with Cohen's d effect size analysis
Retrieval evaluation for RAG (MRR, NDCG, Precision@K)
Automated metrics including BLEU, ROUGE, and BERTScore
0 GitHub stars
Regression detection to prevent performance drift

Casos de Uso

Detecting accuracy regressions caused by prompt engineering changes
Comparing model performance between different LLM providers or versions
Establishing quality baselines for RAG systems before production release

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/hermetic-line llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill

GitHub