Can I add support for new research data types?

Yes, you can use the /haipipe-data-1-source design-chef command to create new SourceFn transformation logic using the provided builder scripts.

Does this skill modify my original raw data files?

No, it reads raw files and outputs processed Parquet files into a dedicated workspace folder, keeping your original source files intact for reproducibility.

How do I verify that my workspace environment is correctly configured?

Before running subcommands, ensure you activate your virtual environment and run 'source env.sh' to load the required workspace paths and environment variables.

What is a SourceSet in the HAIPipe framework?

A SourceSet is a dictionary of pandas DataFrames mapped to specific table names, serving as the standardized output of the Layer 1 pipeline.

Why is schema consistency important in Layer 1?

Consistency ensures that Layer 2 (Record) can process data identically regardless of the original dataset source, enabling robust cross-study analysis and ML training.

HAIPipe Source Data Processor

Name: HAIPipe Source Data Processor
Author: jluo41

byjluo41

0•

데이터 과학 및 ML

Standardizes raw academic and medical data files into structured SourceSet DataFrames for research pipelines.

This skill manages the first layer of the 6-layer HAIPipe research data pipeline, focusing on transforming raw formats like CSV, XML, and Parquet into standardized, domain-consistent table structures. It enables researchers to automate data ingestion, maintain schema consistency across different datasets—such as EHR, CGM, or genomic data—and generate new transformation logic using a builder-pattern 'chef' metaphor. By providing a structured interface for data loading and pipeline execution, it ensures that raw research data is perfectly prepared for downstream record alignment and machine learning feature extraction.

주요 기능

01Enforces strict schema consistency for seamless cross-dataset integration

02Converts raw CSV, XML, Parquet, and JSON into standardized SourceSet DataFrames

03Provides interactive subcommands for pipeline orchestration and data inspection

04Automates code generation for new data transformation functions via 'Chef' builders

05Supports diverse domains including CGM, EHR, genomics, and wearable device data

060 GitHub stars

사용 사례

01Automating reproducible data preprocessing workflows for academic paper incubation

02Standardizing messy wearable device logs into clean, temporally aligned records

03Ingesting heterogeneous clinical trial data into a unified research format

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jluo41/research-skills haipipe-data-1-source

For use in Claude.ai and ChatGPT

Download Skill