How do I measure semantic similarity rather than just word overlap?

The skill implements BERTScore, which utilizes vector embeddings to compare the underlying meaning of two sentences even if they use different vocabulary.

Can this skill help reduce hallucinations in my AI app?

Yes, it provides specific patterns for grounding checks and external fact-verification to ensure outputs are supported by your source context.

What is 'LLM-as-Judge' and how is it used here?

LLM-as-Judge uses a highly capable model to score other outputs based on complex criteria like tone, clarity, and reasoning that traditional math-based metrics often miss.

Is it possible to automate these tests in a CI/CD pipeline?

Absolutely. The skill includes binary pass/fail evaluation patterns designed specifically for unit testing and automated quality gates in development workflows.

Which metrics are included in this evaluation skill?

The skill covers traditional metrics (Exact Match, ROUGE, BERTScore, Perplexity), LLM-as-Judge patterns (rubric scoring, pairwise comparison), and task-specific metrics like F1 and accuracy.

LLM Evaluation & Testing

Name: LLM Evaluation & Testing
Author: applied-artificial-intelligence

byapplied-artificial-intelligence

•

Seguridad y Pruebas

Implements systematic evaluation patterns and metrics to measure LLM performance, validate prompt quality, and detect hallucinations.

Acerca de

This skill provides a comprehensive framework for validating LLM application outputs using a multi-layered evaluation approach. It enables developers to implement traditional metrics like ROUGE and BERTScore alongside advanced LLM-as-Judge patterns for nuanced, rubric-based scoring. Designed for production-grade AI development, it includes specialized modules for hallucination detection and RAG grounding checks, making it essential for building reliable benchmarks and establishing continuous evaluation pipelines within CI/CD workflows.

Características Principales

LLM-as-Judge patterns for rubric-based scoring and pairwise comparisons
Standardized evaluation dataset structures for systematic benchmarking
18 GitHub stars
Implementation of automated metrics including ROUGE, BERTScore, and Perplexity
Hallucination detection and RAG grounding verification methods
Binary pass/fail logic for automated regression testing

Casos de Uso

Integrating automated AI quality gates into CI/CD deployment pipelines
A/B testing prompts to determine which version yields higher quality outputs
Measuring the factual accuracy and retrieval quality of RAG-based systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add applied-artificial-intelligence/claude-code-toolkit llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill

GitHub