What types of metrics does this skill provide?

The skill supports quantitative metrics like Exact Match and Token Overlap (Precision, Recall, F1), as well as qualitative metrics like Semantic Similarity and custom LLM-based scoring.

Can I run A/B tests with this skill?

Yes, it includes a dedicated A/B testing framework that allows you to define model variants, distribute traffic, and compare their performance results side-by-side.

How does the LLM-as-judge pattern work?

It uses a powerful language model to evaluate the output of another model based on defined criteria like accuracy, relevance, and clarity, providing both a numerical score and an explanation.

Does it support data validation for test sets?

Absolutely. It uses Pydantic models to ensure evaluation datasets are structured correctly with inputs, expected outputs, and metadata for reproducible testing.

LLM Evaluation Metrics

Name: LLM Evaluation Metrics
Author: ricardoroche

byricardoroche

0•

データサイエンスとML

Standardizes LLM performance evaluation through automated datasets, A/B testing, and advanced LLM-as-judge patterns.

The LLM Evaluation Metrics skill provides a rigorous framework for assessing AI model performance across diverse tasks. It enables developers to transition from anecdotal testing to data-driven optimization by providing structured patterns for evaluation datasets, quantitative metrics (like F1 and semantic similarity), and qualitative 'LLM-as-judge' scoring. With built-in support for A/B testing and experiment tracking, this skill ensures that model updates are validated against ground-truth references and specific performance benchmarks before deployment.

主な機能

01Structured evaluation dataset management with Pydantic validation

02Comprehensive metric suite including Exact Match, Token Overlap, and Semantic Similarity

03Automated LLM-as-judge implementation for qualitative scoring and reasoning

04Async batch processing for efficient large-scale experiment tracking

05Robust A/B testing framework to compare model variants and traffic weights

060 GitHub stars

ユースケース

01Automating the quality assessment of long-form text generation using an LLM judge

02Benchmarking new prompt versions against a golden test set to prevent regressions

03Comparing the accuracy and latency of different model providers using A/B testing

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ricardoroche/ricardos-claude-code evaluation-metrics

For use in Claude.ai and ChatGPT

Download Skill