Can I use this for RLVR (Reinforcement Learning from Verifiable Rewards)?

Yes, the skill provides specific functions to assess reasoning trace verifiability, requiring 90%+ verifiability for math and coding domains to ensure training signal quality.

How does the skill prevent length bias in DPO?

It includes a mandatory length bias audit that ensures longer responses aren't chosen over better ones more than 70% of the time, preventing the model from learning 'longer = better' shortcuts.

What is IFD scoring in this skill?

Instruction-Following Difficulty (IFD) measures the perplexity ratio between a response and its instruction. A higher score indicates more challenging data, which is more effective for model training.

What are the recommended quality thresholds for SFT?

For standard SFT (Supervised Fine-Tuning), a quality score of ≥8.0 and an IFD score of ≥0.3 are recommended to ensure the model learns from high-value examples.

Quality Scoring & Data Evaluation

Name: Quality Scoring & Data Evaluation
Author: akaszubski

byakaszubski

•

Data Science & ML

Evaluates training data quality using multi-dimensional metrics to ensure high-performance LLM fine-tuning and alignment.

This skill provides a comprehensive framework for assessing the quality of datasets used in SFT, DPO, and RLVR training pipelines. By leveraging six specialized scorers—ranging from the fast Instruction-Following Difficulty (IFD) metric to sophisticated multi-model ensembles—it validates factuality, reasoning logic, and domain relevance. It is particularly useful for preventing common training pitfalls like length bias in DPO pairs and ensuring that only verifiable reasoning traces are used for reinforcement learning, ultimately leading to more robust and reliable AI models.

Key Features

0111 GitHub stars

02Multi-dimensional scoring across 6 metrics including IFD, factuality, and reasoning.

03Automated DPO pair validation with length-bias auditing to prevent model shortcuts.

04Six specialized scorer types ranging from FastIFD to high-fidelity ensemble models.

05RLVR verifiability checks specifically designed for math and coding domains.

06High-performance CLI support for distributed processing on Apple Silicon and other backends.

Use Cases

01Filtering SFT datasets to retain only high-difficulty, high-quality instruction pairs.

02Validating DPO preference pairs to ensure a significant quality margin and eliminate length bias.

03Assessing reasoning traces for RLVR training to guarantee logical and factual verifiability.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add akaszubski/autonomous-dev quality-scoring

For use in Claude.ai and ChatGPT

Download Skill