Which automated metrics are supported?

The skill supports a wide range of metrics including BLEU and ROUGE for text similarity, BERTScore for semantic meaning, and RAG-specific metrics like MRR and NDCG.

How does the LLM-as-Judge pattern work?

It utilizes a more powerful model (like GPT-4 or Claude 3.5 Sonnet) to act as a grader, comparing model outputs against a gold standard or judging them based on specific criteria like helpfulness and safety.

Does it support statistical analysis?

Yes, it includes an A/B testing framework that calculates T-tests, p-values, and Cohen's d to determine if improvements are statistically significant.

Can this skill help with RAG (Retrieval-Augmented Generation)?

Yes, it includes specialized metrics for retrieval performance and groundedness checks to ensure the model is correctly using provided context.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: HermeticOrmus

byHermeticOrmus

0•

数据科学与机器学习

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and comparative benchmarking.

The LLM Performance Evaluation skill provides a robust framework for assessing the quality, reliability, and accuracy of Large Language Model outputs. It equips developers with tools for automated NLP metrics like BLEU and BERTScore, advanced 'LLM-as-judge' patterns for semantic grading, and structured human annotation workflows. Whether you are comparing model variants, validating prompt engineering changes, or setting up regression testing in a CI/CD pipeline, this skill ensures your AI applications meet production-grade standards through rigorous statistical analysis and multi-dimensional scoring.

主要功能

01LLM-as-Judge patterns for pointwise and pairwise evaluation

02Regression detection to prevent performance drift in updates

03Automated NLP metrics including BLEU, ROUGE, and BERTScore

04Retrieval-specific metrics for RAG systems like MRR and NDCG

050 GitHub stars

06Statistical A/B testing framework with Cohen's d effect size

使用场景

01Measuring the impact of prompt engineering on output quality

02Benchmarking different foundation models for specific use cases

03Establishing automated quality gates for AI deployment pipelines

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/alqvimia-contador llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill