What is a golden dataset in AI development?

A golden dataset is a highly curated collection of 'ground truth' data points used as a benchmark to evaluate the accuracy, relevance, and performance of AI models and RAG systems.

Can I use this for RAG evaluation?

Absolutely. This skill is specifically designed to generate the test queries and document classifications needed to build robust RAG evaluation frameworks and regression tests.

Does this skill integrate with observability tools?

Yes, it features built-in Langfuse integration to trace curation decisions, log individual quality scores, and maintain an audit trail for every document evaluated.

How does the multi-agent pipeline work?

The skill deploys parallel agents to analyze different aspects of a document—such as technical quality, difficulty level, and domain relevance—and then uses a consensus aggregator to determine if the data meets specific thresholds for inclusion.

Golden Dataset Curation

Name: Golden Dataset Curation
Author: yonatangross

byyonatangross

•

数据科学与机器学习

Curates high-quality evaluation datasets for AI models using multi-agent validation and automated quality scoring.

This skill automates the process of building and maintaining 'golden datasets'—the ground-truth benchmarks used to evaluate AI model performance and RAG systems. It implements a sophisticated multi-agent pipeline that fetches content, classifies difficulty levels from trivial to adversarial, generates test queries, and scores documents across four key quality dimensions: accuracy, coherence, depth, and relevance. By integrating with Langfuse for observability and utilizing consensus-based decision-making, it ensures that only the most reliable and diverse data enters your evaluation suite, helping to prevent duplicates and maintain balanced domain coverage.

主要功能

01Synthetic test query generation for comprehensive RAG system evaluation.

02Native Langfuse integration for full traceability and audit trails of curation decisions.

0369 GitHub stars

04Automated classification of content types and semantic difficulty levels.

05Multi-agent validation pipeline with weighted consensus aggregation for data inclusion.

06Quality scoring across four dimensions: accuracy, coherence, depth, and relevance.

使用场景

01Auditing existing evaluation data to identify low-quality or redundant document entries.

02Building high-fidelity ground truth datasets for RAG (Retrieval-Augmented Generation) benchmarks.

03Scaling technical documentation reviews through parallel AI-driven quality assessment.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit golden-dataset-curation

For use in Claude.ai and ChatGPT

Download Skill