Does the skill handle imbalanced data?

The skill follows best practices like stratification to ensure that class distributions are maintained across all split subsets.

Does it overwrite my original data?

No, the skill generates separate files (e.g., train.csv, test.csv) to preserve your original dataset while creating the necessary subsets.

Can I specify custom split percentages?

Yes, you can define any custom ratio, such as 70/15/15 or 80/20, and the skill will partition the data accordingly.

What libraries does this skill use to split data?

It typically generates code using standard Python data science libraries such as pandas and scikit-learn for reliable and efficient processing.

Dataset Splitter

Name: Dataset Splitter
Author: jeremylongshore

byjeremylongshore

•

884

データサイエンスとML

Partitions datasets into training, validation, and testing sets to streamline machine learning model development.

概要

The Dataset Splitter skill automates the critical data preparation phase of machine learning by dividing raw datasets into optimized subsets. It intelligently analyzes user-defined proportions, generates robust Python code using standard libraries like scikit-learn, and executes the partitioning while ensuring data integrity. By handling complex requirements such as stratification for imbalanced data and randomization to prevent bias, this skill allows developers to move rapidly from raw data collection to model training with confidence in their evaluation sets.

主な機能

Automated train-test-validation partitioning
Support for custom split ratios and percentages
Data stratification for imbalanced datasets
Seamless creation of training and testing CSV files
Automated Python code generation and execution
884 GitHub stars

ユースケース

Creating representative validation sets for hyperparameter optimization
Preparing a raw CSV dataset for initial machine learning model training
Partitioning large datasets for robust cross-validation and testing

概要

主な機能

Automated train-test-validation partitioning
Support for custom split ratios and percentages
Data stratification for imbalanced datasets
Seamless creation of training and testing CSV files
Automated Python code generation and execution
884 GitHub stars

ユースケース

Creating representative validation sets for hyperparameter optimization
Preparing a raw CSV dataset for initial machine learning model training
Partitioning large datasets for robust cross-validation and testing