What metrics are included for text generation?

The skill includes standard metrics like BLEU and ROUGE for overlap, and BERTScore for semantic similarity using embedding-based comparisons.

Can I use this for RAG (Retrieval-Augmented Generation)?

Yes, it provides specific metrics for retrieval quality such as Mean Reciprocal Rank (MRR), NDCG, and custom groundedness checks.

What is the LLM-as-Judge approach?

It is a pattern where a more powerful model, such as GPT-4 or Claude 3 Opus, is used to evaluate the outputs of a target model based on specific qualitative criteria.

How does it handle performance regressions?

It includes a Regression Detector that compares new test results against a baseline and flags significant drops in performance across defined metrics.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: synqing

bysynqing

データサイエンスとML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and benchmarking.

概要

This skill provides a robust framework for assessing the quality and performance of LLM applications through a multi-layered approach. It covers a wide spectrum of evaluation techniques, including traditional linguistic metrics like BLEU and ROUGE, semantic evaluation using BERTScore, and modern LLM-as-judge patterns. Whether you are detecting performance regressions in CI/CD, comparing model variants through statistical A/B testing, or establishing human-in-the-loop annotation workflows, this skill helps ensure AI outputs remain accurate, safe, and helpful throughout the development lifecycle.

主な機能

Retrieval-augmented generation (RAG) tracking with MRR and NDCG
Automated text generation metrics including BLEU, ROUGE, and BERTScore
LLM-as-Judge patterns for pointwise and pairwise model comparisons
Automated regression detection to prevent quality drops before deployment
Statistical A/B testing framework with p-value and effect size analysis
0 GitHub stars

ユースケース

Benchmarking multiple model providers or prompts to determine the optimal configuration
Validating RAG pipeline accuracy by measuring groundedness and context relevance
Establishing standardized human evaluation workflows to calibrate AI judging

概要

主な機能

Retrieval-augmented generation (RAG) tracking with MRR and NDCG
Automated text generation metrics including BLEU, ROUGE, and BERTScore
LLM-as-Judge patterns for pointwise and pairwise model comparisons
Automated regression detection to prevent quality drops before deployment
Statistical A/B testing framework with p-value and effect size analysis
0 GitHub stars

ユースケース

Benchmarking multiple model providers or prompts to determine the optimal configuration
Validating RAG pipeline accuracy by measuring groundedness and context relevance
Establishing standardized human evaluation workflows to calibrate AI judging