How does it handle regression testing?

It includes a RegressionDetector that compares new results against established baselines and flags any significant performance decreases based on custom thresholds.

What automated metrics does this skill support?

It supports a wide range of metrics including BLEU, ROUGE, METEOR, BERTScore for semantic similarity, and standard classification metrics like Precision, Recall, and F1.

What is the 'LLM-as-judge' pattern?

This pattern uses a highly capable model to evaluate the outputs of other models, allowing for automated grading of subjective qualities like helpfulness, coherence, and safety.

Can I use this for RAG (Retrieval-Augmented Generation) apps?

Yes, it includes specialized metrics for retrieval quality such as MRR, NDCG, and Precision@K, as well as NLI-based groundedness checks to detect hallucinations.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: HermeticOrmus

byHermeticOrmus

0•

数据科学与机器学习

Implements comprehensive evaluation strategies for Large Language Model applications using automated metrics, human feedback, and comparative benchmarking.

The LLM Evaluation skill provides a robust framework for measuring and improving the performance of AI applications. It enables developers to implement systematic testing across multiple dimensions, including automated text metrics like BERTScore and ROUGE, LLM-as-judge patterns for qualitative assessment, and rigorous statistical A/B testing. By providing standardized tools for regression detection and RAG-specific retrieval metrics, this skill helps teams build confidence in their production AI systems and validate improvements from prompt engineering or model fine-tuning.

主要功能

01LLM-as-judge evaluation patterns for pointwise and pairwise scoring

02Automated regression detection to prevent performance drops in CI/CD

030 GitHub stars

04Automated metric computation including BLEU, ROUGE, and BERTScore

05RAG-specific metrics for retrieval quality and groundedness

06Statistical A/B testing framework with Cohen's d effect size analysis

使用场景

01Comparing different model providers or prompt versions systematically

02Measuring the factual accuracy and groundedness of RAG systems

03Automating quality assurance in the AI development lifecycle

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/floreserlife llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill