What are the benefits of using an LLM-as-Judge approach?

LLM-as-Judge allows you to automate qualitative assessments—such as tone, helpfulness, and nuance—by using a more capable model to grade outputs, providing a scalable alternative to manual human review.

How can I detect performance regressions in my LLM?

This skill includes a RegressionDetector that compares new evaluation scores against a baseline. It flags significant drops in performance based on a configurable threshold, helping you catch errors before they hit production.

Does this skill support statistical significance testing?

Yes, it includes an A/B testing framework that calculates p-values and Cohen's d effect sizes, allowing you to determine if a model improvement is statistically significant or just random noise.

Which metrics are best for evaluating RAG applications?

For RAG, you should focus on retrieval metrics like MRR and NDCG for the search component, and generation metrics like groundedness and faithfulness to ensure the AI stays within the provided context.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: HermeticOrmus

byHermeticOrmus

Ciencia de Datos y ML

Implement comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and benchmarking.

Acerca de

This skill provides a structured methodology for measuring the performance, quality, and reliability of Large Language Model applications. It offers implementation patterns for traditional NLP metrics like BLEU and ROUGE, modern embedding-based assessments like BERTScore, and advanced 'LLM-as-Judge' techniques. Whether you are validating RAG pipelines, comparing model versions, or establishing regression testing in a CI/CD environment, this skill equips you with the statistical tools and code patterns needed to build production-grade AI systems with confidence.

Características Principales

Statistical A/B testing framework with Cohen’s d and p-value analysis
RAG-specific evaluation patterns for retrieval and groundedness
LLM-as-Judge scoring for automated qualitative assessment
Regression detection to identify performance drops between model versions
0 GitHub stars
Implementation of automated metrics including BLEU, ROUGE, and BERTScore

Casos de Uso

Validating the accuracy and factuality of RAG-based search systems
Benchmarking a new prompt or model version against a production baseline
Establishing automated quality gates for AI features in deployment pipelines

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/alqvimia-contador llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill

GitHub