01Configuration for multi-node distributed training (NCCL, Torch Distributed)
02Native Apptainer container management and interactive session support
031 GitHub stars
04Simplified job submission using sbatch and ssubmit wrappers
05GPU resource monitoring for H100 and H200 clusters
06Real-time log analysis and Slurm job troubleshooting