Evaluates and remediates tabular data quality issues using the high-performance qsv toolkit.
This skill provides a comprehensive framework for assessing the integrity of CSV and tabular datasets by defining key quality dimensions like completeness, uniqueness, and validity. It offers a structured remediation decision tree to help developers fix common data issues—such as ragged rows, encoding errors, and duplicate entries—in the optimal order to ensure downstream reliability. Whether you are prepping data for machine learning or cleaning legacy exports, this skill streamlines the identification of outliers and inconsistencies using advanced statistical metrics like kurtosis, Gini coefficients, and Shannon entropy.
主要功能
01Automated fix ordering to prevent cascading errors during data cleaning
02Comprehensive data quality dimension reference for completeness, validity, and accuracy
033,614 GitHub stars
04Remediation decision tree for fixing structural and content-based data issues
05Advanced statistical profiling including outlier detection and distribution shape analysis
06Safety checks for malicious payloads and injection patterns in tabular data
使用场景
01Auditing legacy database exports for encoding issues, ragged rows, and duplicate records
02Pre-processing large CSV datasets for machine learning pipelines to ensure model accuracy
03Validating referential integrity and schema conformity during complex data migrations