What metrics does this skill support for text generation?

It supports standard linguistic metrics like BLEU, ROUGE, and METEOR, as well as embedding-based semantic similarity using BERTScore.

Can I use this for RAG applications?

Yes, the skill includes specialized metrics for retrieval-augmented generation such as MRR, NDCG, and automated groundedness checks.

How does it help with production deployments?

It includes a regression detector that compares new results against a set baseline to ensure that prompt or model updates don't degrade existing performance.

What is the 'LLM-as-Judge' feature?

It provides patterns to use a highly capable model to automatically grade the outputs of other models based on specific criteria like accuracy, helpfulness, and safety.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: rmyndharis

byrmyndharis

•

424

•

데이터 과학 및 ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking frameworks.

The LLM Evaluation skill provides a robust framework for measuring and improving the quality of AI applications. It offers standardized implementations for text generation metrics like BLEU and ROUGE, semantic similarity through BERTScore, and specialized retrieval metrics for RAG systems. Beyond automated scoring, it includes patterns for LLM-as-judge evaluations, human annotation workflows, and statistical A/B testing to detect performance regressions and validate prompt improvements before production deployment. This skill is essential for developers looking to move beyond vibes-based testing to data-driven AI development.

주요 기능

01LLM-as-Judge patterns for automated pointwise and pairwise output comparisons

02Specialized RAG evaluation metrics for retrieval performance and groundedness

03Automated text generation and classification metrics including BLEU, ROUGE, and BERTScore

04Statistical A/B testing framework with Cohen’s d effect size and p-value analysis

05424 GitHub stars

06Automated regression detection to track performance across model and prompt versions

사용 사례

01Comparing the performance of different model versions or providers for specific domain tasks

02Measuring the impact of prompt engineering changes on model accuracy and coherence

03Establishing automated CI/CD evaluation pipelines for RAG-based production systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add rmyndharis/antigravity-skills llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill