How does this skill help with RAG pipelines?

It provides built-in support for RAGAS metrics, allowing you to measure faithfulness, answer relevancy, and context precision to ensure your RAG system is grounded in facts.

How does the skill handle hallucination detection?

It implements logic to compare the generated output against provided context to identify and flag claims that are not supported by the source material.

Can I prevent low-quality AI responses from reaching users?

Yes, by using the Quality Gates feature, you can set a threshold (e.g., 0.7) that must be met by the evaluation metrics before the output is allowed to proceed.

What is the LLM-as-judge pattern?

It is an evaluation technique where a separate, often more capable LLM is used to score the outputs of another model based on specific criteria like relevance, accuracy, or tone.

LLM Evaluation & Quality Assessment

Name: LLM Evaluation & Quality Assessment
Author: yonatangross

byyonatangross

•

Data Science & ML

Implements automated quality gates, LLM-as-judge patterns, and RAGAS metrics to ensure reliable and grounded AI outputs.

This skill provides a comprehensive framework for evaluating Large Language Model outputs using production-ready patterns within the Claude Code environment. It enables developers to implement automated quality assurance through LLM-as-judge scoring, hallucination detection, and RAGAS metrics such as faithfulness and answer relevancy. By integrating configurable quality gates and pairwise comparisons, the skill ensures that AI-generated content meets specific performance thresholds before reaching production, making it indispensable for building robust RAG pipelines and self-correcting agent loops.

Key Features

01Batch evaluation and pairwise comparison for model benchmarking

02RAGAS metrics integration for validating retrieval-augmented generation

03Configurable quality gates with customizable pass/fail thresholds

04LLM-as-Judge patterns for automated, multi-dimensional quality scoring

0569 GitHub stars

06Automated hallucination detection and factual grounding verification

Use Cases

01Conducting A/B testing and preference ranking between different model versions

02Building self-correcting RAG pipelines that validate context and answer accuracy

03Implementing automated CI/CD quality checks for AI-driven application features

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill