About
This skill equips Claude with specialized knowledge of PySpark fundamentals, enabling it to generate, debug, and optimize distributed data processing scripts. It provides implementation patterns for initializing SparkSessions, handling diverse data formats like Parquet and Delta Lake, and performing complex transformations using the DataFrame API and Window functions. By integrating best practices for performance tuning—such as broadcasting and predicate pushdown—it helps data engineers and scientists build scalable, production-ready data pipelines more efficiently.