How does the LLM-as-judge feature work?

It provides patterns to use highly capable models to evaluate the outputs of other models based on defined rubrics, pointwise scoring, or pairwise comparisons.

Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, the skill includes specific metrics and strategies for evaluating retrieval performance, groundedness, and context relevance in RAG pipelines.

Does it support statistical significance testing?

Yes, it includes tools for A/B testing and calculating p-values and Cohen's d to ensure that improvements in model performance are statistically significant.

What types of metrics are supported by this skill?

It supports automated text metrics (BLEU, ROUGE, BERTScore), retrieval metrics (MRR, NDCG), classification metrics (F1, Precision/Recall), and custom qualitative metrics using LLM-as-judge.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: HermeticOrmus

byHermeticOrmus

Ciencia de Datos y ML

Evaluates Large Language Model application performance using automated metrics, human feedback loops, and LLM-as-judge frameworks.

Acerca de

The llm-evaluation skill provides a comprehensive framework for measuring and improving the quality of AI applications. It bridges the gap between raw model output and production-ready performance by implementing systematic evaluation strategies, including automated n-gram and embedding metrics, LLM-as-judge patterns for semantic validation, and human-in-the-loop annotation structures. Whether you are benchmarking different models, detecting performance regressions, or validating RAG pipelines, this skill equips developers with the statistical and programmatic tools needed to establish reliable baselines and ensure model outputs remain accurate, safe, and helpful over time.

Características Principales

Automated metrics implementation including BLEU, ROUGE, and BERTScore
LLM-as-judge patterns for pointwise and pairwise model comparisons
Statistical A/B testing and Cohen's d effect size analysis
Regression detection frameworks to prevent performance drops during deployment
Retrieval (RAG) specific evaluation metrics like MRR and NDCG
0 GitHub stars

Casos de Uso

Validating prompt engineering changes to prevent quality regressions
Benchmarking new model versions against established performance baselines
Measuring the groundedness and factual accuracy of RAG application outputs

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/floraheritage llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill

GitHub