Is this skill suitable for RAG evaluation pipelines?

Yes, it is specifically optimized for RAG by validating document-query relationships and ensuring that every query correctly references existing content chunks.

How does this skill detect duplicate content?

The skill utilizes a combination of URL normalization for exact source matches and semantic similarity checks using embeddings to identify contextually identical content.

What is a 'golden dataset' in this context?

A golden dataset is a high-quality, manually verified set of ground truth data used to evaluate the performance and accuracy of AI models and RAG systems.

Why does the skill flag placeholder URLs?

Valid evaluation requires authentic source references; placeholders like 'example.com' or 'localhost' degrade the quality of citation-based AI testing and training.

Does it check for query difficulty balance?

Yes, it analyzes the distribution of query difficulty levels—from trivial to adversarial—to ensure your dataset provides a comprehensive test of model capabilities.

Golden Dataset Validation

Name: Golden Dataset Validation
Author: yonatangross

byyonatangross

•

数据科学与机器学习

Validates the integrity and quality of AI evaluation datasets through schema enforcement, duplicate detection, and coverage analysis.

Golden Dataset Validation is a specialized capability designed for AI engineers and data scientists to maintain high-quality benchmarks for LLM evaluation. It automates rigorous checks for document and query schemas, prevents the inclusion of placeholder or duplicate content, and ensures a balanced distribution of query difficulties. By providing detailed gap analysis, referential integrity checks, and semantic similarity detection, it ensures your ground-truth data remains reliable, unique, and representative for high-stakes performance testing.

主要功能

01Comprehensive referential integrity between queries and data sections

02Semantic similarity and URL-based duplicate detection

03Quality metric analysis for titles, content length, and tagging

04Dataset gap detection for content types and difficulty distributions

058 GitHub stars

06Automated JSON schema validation for documents and queries

使用场景

01Auditing existing data repositories for broken references and placeholder URLs

02Validating new benchmark data before merging into a RAG evaluation set

03Identifying under-represented topics or difficulty levels in an AI training dataset

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/skillforge-claude-plugin golden-dataset-validation

For use in Claude.ai and ChatGPT

Download Skill