What automated metrics are included for text generation?

The skill includes implementations for BLEU (translation), ROUGE (summarization), and BERTScore (semantic similarity), along with classification metrics like F1 and Accuracy.

Can I use this for RAG (Retrieval-Augmented Generation) testing?

Yes, the skill provides specific strategies for RAG, including retrieval metrics like MRR and NDCG, as well as groundedness checks to ensure responses are based only on the provided context.

How does 'LLM-as-Judge' improve my evaluation process?

It allows you to use highly capable models like Claude to automatically grade the outputs of other models based on complex criteria like helpfulness, reasoning, and tone, which are difficult for traditional code-based metrics to capture.

Does it support human-in-the-loop evaluation?

Absolutely. It includes structures for defining human annotation tasks and functions to calculate inter-rater agreement (Cohen's Kappa) to ensure your manual evaluations are consistent and reliable.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: Microck

byMicrock

•

Data Science & ML

Implements systematic evaluation strategies for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

The LLM Evaluation skill provides a comprehensive toolkit for measuring and improving the quality of Large Language Model applications. It enables developers to move beyond anecdotal testing by implementing industry-standard automated metrics like BLEU and ROUGE, advanced 'LLM-as-Judge' scoring systems, and structured human evaluation workflows. Whether you are comparing prompt variations, detecting performance regressions, or validating RAG systems, this skill provides the rigorous benchmarking framework needed to build confidence in production-grade AI systems.

Key Features

01LLM-as-Judge patterns for qualitative pointwise and pairwise scoring

022 GitHub stars

03Human annotation frameworks with inter-rater agreement calculation

04Automated NLP metrics including BLEU, ROUGE, and BERTScore

05Statistical A/B testing for systematic model comparison

06RAG-specific evaluation for retrieval accuracy and groundedness

Use Cases

01Validating the accuracy and relevance of RAG-based application outputs

02Establishing automated CI/CD regression testing for AI-generated content

03Benchmarking different model versions or prompt engineering strategies

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add microck/opendots-microck llm-evaluation

For use in Claude.ai and ChatGPT

Key Features

01LLM-as-Judge patterns for qualitative pointwise and pairwise scoring

022 GitHub stars

03Human annotation frameworks with inter-rater agreement calculation

04Automated NLP metrics including BLEU, ROUGE, and BERTScore

05Statistical A/B testing for systematic model comparison

06RAG-specific evaluation for retrieval accuracy and groundedness

Use Cases

01Validating the accuracy and relevance of RAG-based application outputs

02Establishing automated CI/CD regression testing for AI-generated content

03Benchmarking different model versions or prompt engineering strategies

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add microck/opendots-microck llm-evaluation

For use in Claude.ai and ChatGPT