What RAG metrics are supported by this skill?

The skill supports key RAGAS metrics including Faithfulness (grounding), Answer Relevancy, Context Precision, and Context Recall.

What is the LLM-as-judge pattern?

It is a technique where a secondary, often more capable LLM is used to evaluate and score the outputs of another model based on specific criteria like relevance, tone, or accuracy.

How do quality gates work in this context?

Quality gates are automated checks that compare evaluation scores against a pre-defined threshold (e.g., 0.7). If the output doesn't meet the score, the gate can block the response or trigger a retry.

Can I use the same model to judge its own output?

No, this is a forbidden anti-pattern. You should always use a different judge model (e.g., using GPT-4o-mini to judge Claude) to ensure an objective and unbiased assessment.

LLM Output Evaluation & Quality Assurance

Name: LLM Output Evaluation & Quality Assurance
Author: yonatangross

byyonatangross

•

数据科学与机器学习

Implements advanced LLM-as-judge patterns and RAGAS metrics to evaluate AI output quality and detect hallucinations.

The LLM Evaluation skill provides a robust framework for assessing the quality and reliability of AI-generated content within the Claude Code environment. By leveraging production-ready patterns like LLM-as-judge and RAGAS (Retrieval-Augmented Generation Assessment), it enables developers to implement automated quality gates, detect hallucinations, and measure faithfulness across multiple dimensions. This skill is essential for building production-grade AI systems that require rigorous validation, factual grounding, and consistent output quality before delivery to end-users.

主要功能

01LLM-as-judge implementation for multi-dimensional output scoring

02Pairwise comparison for A/B testing different model outputs

03Configurable quality gates to block low-confidence AI responses

0469 GitHub stars

05Automated hallucination detection and factual grounding checks

06RAGAS metrics support for faithfulness, relevancy, and context precision

使用场景

01Detecting and filtering hallucinations in production-facing customer support bots

02Implementing a quality gate in a RAG pipeline to ensure context relevance

03Running batch evaluations on golden datasets to measure model performance changes

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill