What is the LLM-as-judge pattern?

LLM-as-judge is an evaluation method where a highly capable model, such as Claude 3.5 Sonnet, is used to grade the responses of another model based on specific rubrics like helpfulness or reasoning.

Which metrics are best for evaluating RAG systems?

RAG systems are best evaluated using retrieval metrics like Mean Reciprocal Rank (MRR) and NDCG, alongside generation metrics like groundedness and context relevance.

How do I ensure human evaluations are reliable?

This skill includes frameworks for calculating inter-rater agreement, such as Cohen's Kappa, to ensure that multiple human annotators are grading responses consistently.

When should I use BERTScore over BLEU?

Use BERTScore when you need to measure semantic similarity rather than just keyword overlap. BLEU is better for literal tasks like translation, while BERTScore understands context.

What is LLM evaluation?

LLM evaluation is the process of systematically measuring the quality, accuracy, and performance of outputs from Large Language Models using a mix of automated metrics and manual assessment.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: gwickman

bygwickman

0•

データサイエンスとML

Implement and automate comprehensive evaluation strategies for Large Language Model applications using metrics, human feedback, and LLM-as-judge patterns.

The LLM Evaluation skill provides a robust framework for measuring and improving the quality of AI-generated content. It equips developers with the tools to implement automated metrics like BLEU and BERTScore, establish LLM-as-judge patterns for semantic validation, and manage human annotation workflows. Whether you are benchmarking different models, detecting performance regressions, or optimizing RAG systems, this skill offers standardized implementation patterns for building reliable, production-ready LLM applications that maintain high standards of accuracy and safety.

主な機能

01Human evaluation frameworks with inter-rater agreement (Cohen's Kappa) calculations

02LLM-as-judge patterns for pointwise, pairwise, and reference-based comparisons

03Custom metric development for groundedness, toxicity, and factuality detection

04Automated NLP metrics implementation including BLEU, ROUGE, and BERTScore

05RAG performance measurement with MRR, NDCG, and Precision@K

060 GitHub stars

ユースケース

01Validating RAG system retrieval accuracy and groundedness before deployment

02Establishing automated CI/CD pipelines with quality gates for AI outputs

03Benchmarking model performance across different prompt versions and iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add gwickman/auto-dev-test-target-1 llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill