Does it support security auditing for data?

Yes, it includes checks for injection safety to identify potential formula or SQL injection patterns within data cells before they are loaded into databases.

Why is the fix ordering important in the remediation tree?

Applying fixes in the recommended order (e.g., encoding before headers) ensures that each step operates on stable data, preventing cascading errors during the cleaning process.

How does it handle outlier detection?

It utilizes advanced statistical metrics such as kurtosis, Jarque-Bera p-values, and the winsorized mean to identify extreme values and non-normal distributions.

Can this skill help me fix broken or malformed CSV files?

Yes, it provides a specific remediation decision tree for issues like ragged rows, incorrect encoding, and leading/trailing whitespace using commands like fixlengths and input.

What is the primary tool used by this data quality skill?

This skill leverages qsv, a blazing-fast command-line toolkit designed for high-performance data-wrangling and tabular data analysis.

Data Quality Assessment

Name: Data Quality Assessment
Author: dathere

bydathere

•

3,614

•

数据科学与机器学习

Evaluates and remediates tabular data quality issues using the high-performance qsv toolkit.

This skill provides a comprehensive framework for assessing the integrity of CSV and tabular datasets by defining key quality dimensions like completeness, uniqueness, and validity. It offers a structured remediation decision tree to help developers fix common data issues—such as ragged rows, encoding errors, and duplicate entries—in the optimal order to ensure downstream reliability. Whether you are prepping data for machine learning or cleaning legacy exports, this skill streamlines the identification of outliers and inconsistencies using advanced statistical metrics like kurtosis, Gini coefficients, and Shannon entropy.

主要功能

01Automated fix ordering to prevent cascading errors during data cleaning

02Comprehensive data quality dimension reference for completeness, validity, and accuracy

033,614 GitHub stars

04Remediation decision tree for fixing structural and content-based data issues

05Advanced statistical profiling including outlier detection and distribution shape analysis

06Safety checks for malicious payloads and injection patterns in tabular data

使用场景

01Auditing legacy database exports for encoding issues, ragged rows, and duplicate records

02Pre-processing large CSV datasets for machine learning pipelines to ensure model accuracy

03Validating referential integrity and schema conformity during complex data migrations

主要功能

01Automated fix ordering to prevent cascading errors during data cleaning

02Comprehensive data quality dimension reference for completeness, validity, and accuracy

033,614 GitHub stars

04Remediation decision tree for fixing structural and content-based data issues

05Advanced statistical profiling including outlier detection and distribution shape analysis

06Safety checks for malicious payloads and injection patterns in tabular data

使用场景

01Auditing legacy database exports for encoding issues, ragged rows, and duplicate records

02Pre-processing large CSV datasets for machine learning pipelines to ensure model accuracy

03Validating referential integrity and schema conformity during complex data migrations