What is a golden dataset?

A golden dataset is a curated set of high-quality data used as a ground truth to evaluate the performance of AI models, particularly in retrieval and generation tasks.

How does the multi-agent analysis pipeline work?

The skill triggers multiple specialized agents—Quality Evaluator, Difficulty Classifier, Domain Tagger, and Query Generator—to analyze content in parallel before a consensus aggregator provides a final recommendation.

Does this skill help with RAG evaluation?

Yes, it specifically generates test queries of varying difficulty (trivial to adversarial) to test the robustness of retrieval-augmented generation systems.

What content types can this skill classify?

It is optimized for technical articles, step-by-step tutorials, research papers, API documentation, video transcripts, and code repositories.

AI Golden Dataset Curation

Name: AI Golden Dataset Curation
Author: yonatangross

byyonatangross

•

데이터 과학 및 ML

Builds and refines high-quality AI evaluation datasets using multi-agent analysis and standardized quality metrics.

The Golden Dataset Curation skill automates the rigorous process of building ground-truth datasets for AI model evaluation, specifically optimized for RAG and LLM benchmarking. It utilizes a multi-agent pipeline to classify content types, determine semantic difficulty, and evaluate documents across four key dimensions: accuracy, coherence, depth, and relevance. By automating technical density scoring and synthetic test query generation, this skill ensures that your evaluation data is robust, diverse, and high-quality, saving significant time for AI engineers and data scientists.

주요 기능

01Automated content type and semantic difficulty classification

028 GitHub stars

03Synthetic test query generation with varied complexity levels

04Multi-agent quality validation pipeline with consensus aggregation

05Domain-specific tagging for RAG and embedding evaluation

06Technical accuracy and depth scoring for documentation

사용 사례

01Building high-fidelity benchmark datasets for RAG applications

02Improving the quality of existing training data via multi-agent review

03Generating diverse test queries for semantic search and retrieval evaluation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/skillforge-claude-plugin golden-dataset-curation

For use in Claude.ai and ChatGPT

주요 기능

01Automated content type and semantic difficulty classification

028 GitHub stars

03Synthetic test query generation with varied complexity levels

04Multi-agent quality validation pipeline with consensus aggregation

05Domain-specific tagging for RAG and embedding evaluation

06Technical accuracy and depth scoring for documentation

사용 사례

01Building high-fidelity benchmark datasets for RAG applications

02Improving the quality of existing training data via multi-agent review

03Generating diverse test queries for semantic search and retrieval evaluation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/skillforge-claude-plugin golden-dataset-curation

For use in Claude.ai and ChatGPT