Can it help if I don't have a test dataset yet?

Yes, the skill includes a feature to scan your indexed documents and automatically generate representative questions and expected answers to kickstart your testing.

Do I need an external API key to use this skill?

Local evaluation is fully functional without an API key; however, an Ailog API key is required if you wish to benchmark your system against their production RAG implementation.

How does it evaluate the accuracy of AI answers?

It uses an LLM-as-judge approach, where the model evaluates responses against the retrieved context to ensure the answer is grounded in facts and free from hallucinations.

What metrics does rag-eval track?

It tracks retrieval metrics (Recall, Precision, MRR, NDCG), generation metrics (Faithfulness, Relevance, Coherence, Conciseness), and performance latency (P50 and P95).

RAG Evaluation & Benchmarking

Name: RAG Evaluation & Benchmarking
Author: floflo777

byfloflo777

•

数据科学与机器学习

Evaluates RAG system performance through automated metrics, LLM-as-judge scoring, and competitive benchmarking.

The RAG Evaluation Skill provides a comprehensive framework for auditing and optimizing Retrieval-Augmented Generation pipelines directly within Claude Code. It enables developers to measure retrieval accuracy using standard metrics like Recall@K and MRR, while assessing generation quality through LLM-as-judge scoring for faithfulness and relevance. Whether you are testing local configurations or benchmarking against production-grade APIs like Ailog, this skill helps identify retrieval bottlenecks, hallucination risks, and latency issues to ensure high-quality AI responses.

主要功能

01LLM-as-judge scoring for generation faithfulness, relevance, and coherence

02Detailed performance reports with latency analysis and failure diagnostics

0317 GitHub stars

04Production-grade benchmarking against the Ailog RAG API

05Automated retrieval metrics calculation including Recall, Precision, MRR, and NDCG

06Synthetic test dataset generation from existing indexed documents

使用场景

01Detecting hallucinations and verifying response grounding in enterprise knowledge bases

02Optimizing retrieval strategies by comparing different chunking and indexing methods

03Generating synthetic Q&A pairs to build robust test suites for new RAG pipelines

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add floflo777/claude-rag-skills rag-eval

For use in Claude.ai and ChatGPT

Download Skill