How does this skill prevent data leakage in R?

It scans your code for common pitfalls like applying transformations before the initial split or using test data within the recipe prep() function.

Is this skill based on a specific R standard?

The patterns are derived directly from 'Tidy Modeling with R' (TMwR) by Max Kuhn and Julia Silge, the definitive guide for the Tidymodels ecosystem.

Why does the skill suggest using Workflows instead of manual prep?

The 'workflows' package in R automatically handles the application of training statistics to new data, significantly reducing the risk of manual errors and leakage.

Does it support imbalanced dataset best practices?

Yes, it detects when stratified sampling is missing in cross-validation or initial splits, which is critical for maintaining class proportions.

Tidymodels Review Patterns

Name: Tidymodels Review Patterns
Author: choxos

bychoxos

0•

데이터 과학 및 ML

Automates code review for R Tidymodels workflows to prevent data leakage and ensure statistical best practices.

This skill acts as a specialized auditor for R data science projects, focusing on the Tidymodels ecosystem and 'Tidy Modeling with R' (TMwR) principles. It systematically scans R scripts for critical anti-patterns such as data leakage, improper resampling, and workflow mismanagement. By identifying issues like preprocessing before splitting or missing stratification in imbalanced datasets, it helps data scientists build more robust, reproducible, and statistically valid machine learning models while reducing the risk of overly optimistic performance estimates.

주요 기능

01Identifies resampling violations including missing stratification for imbalanced data

020 GitHub stars

03Enforces 'workflow' object usage to automate safe preprocessing and fitting

04Detects critical data leakage patterns like prepping recipes on test data

05Validates evaluation logic to prevent testing on training data

06Checks for reproducibility by flagging missing random seeds in stochastic operations

사용 사례

01Peer-reviewing data science scripts to ensure statistical validity and no data leakage

02Auditing complex machine learning pipelines for imbalanced classification tasks

03Onboarding developers to the Tidymodels ecosystem using TMwR best practices

주요 기능

01Identifies resampling violations including missing stratification for imbalanced data

020 GitHub stars

03Enforces 'workflow' object usage to automate safe preprocessing and fitting

04Detects critical data leakage patterns like prepping recipes on test data

05Validates evaluation logic to prevent testing on training data

06Checks for reproducibility by flagging missing random seeds in stochastic operations

사용 사례

01Peer-reviewing data science scripts to ensure statistical validity and no data leakage

02Auditing complex machine learning pipelines for imbalanced classification tasks

03Onboarding developers to the Tidymodels ecosystem using TMwR best practices