The Golden Dataset Curation skill automates the rigorous process of building ground-truth datasets for AI model evaluation, specifically optimized for RAG and LLM benchmarking. It utilizes a multi-agent pipeline to classify content types, determine semantic difficulty, and evaluate documents across four key dimensions: accuracy, coherence, depth, and relevance. By automating technical density scoring and synthetic test query generation, this skill ensures that your evaluation data is robust, diverse, and high-quality, saving significant time for AI engineers and data scientists.
주요 기능
01Automated content type and semantic difficulty classification
028 GitHub stars
03Synthetic test query generation with varied complexity levels
04Multi-agent quality validation pipeline with consensus aggregation
05Domain-specific tagging for RAG and embedding evaluation
06Technical accuracy and depth scoring for documentation