What is the LLM-as-judge approach?

This involves using a more capable model (like Claude 3.5 Sonnet) to evaluate the outputs of other models based on nuanced criteria like helpfulness, safety, and tone.

Which automated metrics are supported for text generation?

The skill includes implementation patterns for BLEU, ROUGE, METEOR, and BERTScore to measure similarity and quality against reference texts.

Can I use this skill to evaluate RAG applications?

Yes, it provides specific metrics for Retrieval-Augmented Generation, including MRR, NDCG, and custom groundedness checks to ensure responses are based on context.

Does it support statistical analysis for A/B testing?

Yes, it includes a testing framework that calculates p-values and Cohen's d to determine if improvements in a new model version are statistically significant.

How does it handle regression testing?

The skill includes a Regression Detector that compares new evaluation results against a stored baseline and flags significant performance decreases.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: zsutxz

byzsutxz

0•

データサイエンスとML

Implement comprehensive evaluation frameworks for AI applications using automated metrics, human feedback, and LLM-as-judge patterns.

This skill provides a structured framework for measuring and improving the quality of Large Language Model applications through rigorous testing methodologies. It equips developers with the tools to implement automated text metrics like BLEU and BERTScore, establish human annotation workflows, and utilize advanced 'LLM-as-judge' patterns for semantic quality assessment. Whether you are comparing model versions, detecting performance regressions before deployment, or running A/B tests on prompts, this skill ensures your AI systems remain reliable, factually accurate, and performant over time.

主な機能

01Human evaluation workflows with inter-rater agreement (Cohen's Kappa) tracking

020 GitHub stars

03Automated metrics integration including BLEU, ROUGE, and BERTScore

04LLM-as-judge patterns for pointwise and pairwise qualitative assessment

05Regression detection to prevent performance drops during model or prompt updates

06Statistical A/B testing framework with Cohen's d effect size analysis

ユースケース

01Benchmarking RAG system performance using retrieval and groundedness metrics

02Validating prompt engineering improvements with statistical significance

03Establishing automated CI/CD evaluation pipelines to catch model hallucinations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add zsutxz/claudelearning llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill