How does this skill help with Spark memory errors?

It provides standardized templates for tuning executor memory, memoryOverhead, and memory fractions to prevent common JVM and off-heap OutOfMemory (OOM) issues.

Which join types are covered in these patterns?

The skill provides guidance and code for Broadcast joins, Sort-Merge joins, and Bucket joins, ensuring the most efficient strategy is used based on dataset size.

How do I apply these configurations to my existing project?

The skill offers a modular configuration cheat sheet and reusable PySpark snippets that can be integrated directly into your SparkSession builder or spark-submit scripts.

Can it handle data skew in large joins?

Yes, it includes patterns for manual salting and configuring Adaptive Query Execution (AQE) to identify and distribute skewed keys across tasks evenly.

Does it support Delta Lake optimizations?

Yes, it includes optimization patterns specifically for Delta Lake environments, such as Z-Ordering, bin-packing, and auto-compaction settings.

Apache Spark Optimization

Name: Apache Spark Optimization
Author: wshobson

bywshobson

•

23,194

•

数据科学与机器学习

Optimizes Apache Spark jobs through advanced partitioning, memory management, and shuffle tuning strategies.

This skill provides production-grade patterns for enhancing Apache Spark job performance, helping developers build scalable data pipelines. It enables Claude to implement optimal partitioning strategies, configure fine-grained executor memory settings, and debug performance bottlenecks like data skew or excessive shuffling. By applying industry-standard techniques such as Adaptive Query Execution (AQE), broadcast joins, and efficient serialization, this skill ensures that big data processing remains cost-effective and performant even at a multi-terabyte scale.

主要功能

01Executor memory and storage fraction tuning

02Join optimization (Broadcast, Bucket, and Sort-Merge)

03Adaptive Query Execution (AQE) configuration

04Advanced partitioning and data skew salting

0523,194 GitHub stars

06Efficient data format and serialization settings

使用场景

01Scaling data pipelines to handle multi-terabyte datasets efficiently

02Reducing cloud infrastructure costs by minimizing Spark job execution time

03Resolving OutOfMemory (OOM) errors and executor failures in production

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add wshobson/agents spark-optimization

For use in Claude.ai and ChatGPT

Download Skill