How do I submit a job using this skill?

You can use standard Slurm commands like sbatch or the recommended ssubmit wrapper, which simplifies parameters for GPUs, CPUs, memory, and time limits.

Can I monitor my job's resource usage in real-time?

Yes, the skill provides links to specialized Grafana dashboards for cluster overviews, workload monitoring, and job-level resource consumption stats.

Which GPU architectures are supported?

The skill includes specific configurations for Oracle OKE clusters (NVIDIA H100 GPUs) and DO DOKS clusters (NVIDIA H200 GPUs).

How does data access work across clusters?

Both clusters use JuiceFS for unified data access at /data0/ or /data/srp/, maintaining consistent directory structures and permissions across development and cluster environments.

What is the best way to run interactive GPU sessions?

Use the 'sapptainer' command provided by the skill to start an interactive job with specific CPU, memory, and GPU requirements within a chosen container image.

Slurm GPU Cluster Management

Name: Slurm GPU Cluster Management
Author: SerendipityOneInc

bySerendipityOneInc

•

数据科学与机器学习

Manages GPU-accelerated workloads on Slurm clusters using Apptainer containers for training and inference.

This skill streamlines the workflow for developers interacting with SRP's Slurm clusters, specifically optimized for H100 and H200 GPU workloads. It provides comprehensive guidance for submitting jobs via sbatch or the ssubmit wrapper, managing Apptainer container environments, and configuring multi-node distributed training. By integrating cluster monitoring, unified JuiceFS data access patterns, and automated Feishu notifications, this skill helps users troubleshoot failures, monitor resource utilization, and maintain efficient high-performance computing operations directly within Claude Code.

主要功能

01Configuration for multi-node distributed training (NCCL, Torch Distributed)

02Native Apptainer container management and interactive session support

031 GitHub stars

04Simplified job submission using sbatch and ssubmit wrappers

05GPU resource monitoring for H100 and H200 clusters

06Real-time log analysis and Slurm job troubleshooting

使用场景

01Managing batch inference jobs and data processing tasks at scale

02Debugging Slurm job failures and optimizing resource allocation

03Training and fine-tuning large-scale AI models on high-performance GPU clusters

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add serendipityoneinc/srp-claude-code-marketplace slurm

For use in Claude.ai and ChatGPT

Download Skill