Facilitates human-in-the-loop judgment of AI trace outputs and promotes high-quality results back into evaluation datasets.
This skill streamlines the post-evaluation workflow for AI models by providing a structured environment for judging flagged trace outputs. It bridges the gap between raw execution logs and high-quality datasets through an interactive Q&A protocol that supports binary, categorical, and continuous scoring. By allowing developers to perform deep error analysis, triage pending review queues, and record judgment rationale, this skill ensures that only validated outputs are promoted back into datasets for future testing, fine-tuning, or benchmarking.
主要功能
013 GitHub stars
02Context-aware judgment options derived from evaluation metadata
03Support for detailed audit trails with judgment notes and rationale
04Automated promotion of reviewed traces to curated datasets
05Multi-run queue triage and status-based grouping
06Interactive Q&A protocol for structured human judgment
使用场景
01Building gold-standard datasets from successful production traces
02Conducting manual error analysis on flagged model evaluation runs
03Triaging and scoring large review queues across multiple agent projects