How does this skill detect temporal leakage?

It analyzes whether features are derived from future events or aggregations that include the prediction period, ensuring values would be known at the moment of inference.

What statistical thresholds indicate data leakage?

This skill flags single features with an AUC or R-squared greater than 0.95, or instances where one feature dominates more than 50% of total model importance.

Can this skill help with train/test contamination?

Yes, it detects group leakage where the same entity appears in both sets or where preprocessing like scaling was incorrectly fit on the entire dataset.

What is target leakage in machine learning?

Target leakage occurs when information that won't be available at prediction time is included in the training data, leading to overly optimistic but non-generalizable results.

Does it provide code for automated detection?

Yes, it includes a Python script template using pandas and scikit-learn to programmatically flag high-risk features in your dataset.

Target Leakage Detection

Name: Target Leakage Detection
Author: andikarachman

byandikarachman

•

データサイエンスとML

Detects and prevents data leakage in machine learning feature sets to ensure model performance generalizes to production.

The Target Leakage Detection skill provides a comprehensive framework for identifying data leakage, which occurs when information from the target or the future inadvertently enters training data. It automates the detection of temporal inconsistencies, direct feature-to-target proxies, and statistical anomalies like suspiciously high AUC or R-squared values. By applying these rigorous validation checks during the experimentation phase, data scientists can avoid artificially inflated metrics and build robust models that maintain their predictive power in real-world deployment scenarios.

主な機能

01Statistical signal flagging for high AUC and R-squared values

02Temporal validity verification for feature-target timing

03Identification of direct target proxies and downstream transformations

049 GitHub stars

05Comprehensive remediation strategies for fixing identified leaks

06Cross-boundary group leakage detection for train/test splits

ユースケース

01Debugging unexpectedly high model performance metrics during development

02Validating feature sets before training supervised machine learning models

03Ensuring time-series data splits maintain temporal integrity for forecasting

主な機能

01Statistical signal flagging for high AUC and R-squared values

02Temporal validity verification for feature-target timing

03Identification of direct target proxies and downstream transformations

049 GitHub stars

05Comprehensive remediation strategies for fixing identified leaks

06Cross-boundary group leakage detection for train/test splits

ユースケース

01Debugging unexpectedly high model performance metrics during development

02Validating feature sets before training supervised machine learning models

03Ensuring time-series data splits maintain temporal integrity for forecasting