Does this skill work with any GPU architecture?

Yes, it includes specific optimization matrices for H100, A100, RTX 4090, and RTX 3090, with a generic fallback for other cards.

What is the benefit of layered configuration?

Layering ensures that 'quick_test' modes only reduce timesteps or epochs while keeping critical hardware settings like n_envs and torch.compile enabled for full speed.

Why is my A100 training speed lower than expected?

Slow training (e.g., <10,000 FPS) is usually caused by low environment counts (n_envs) or disabled torch.compile. This skill forces optimal hardware saturation.

How much faster can my training become?

By correctly saturating the GPU and using torch.compile, many users see a 10x+ increase in FPS compared to generic default configurations.

Does it support FP8 precision?

Yes, it automatically detects compute capability 9.0+ (H100) to enable FP8 precision for additional performance gains.

GPU-Aware ML Training Configuration

Name: GPU-Aware ML Training Configuration
Author: smith6jt-cop

bysmith6jt-cop

数据科学与机器学习

Optimizes PPO training performance on A100 and H100 GPUs by automatically aligning hyperparameters with hardware capabilities.

关于

This skill provides a specialized framework for configuring Reinforcement Learning (RL) training sessions, specifically PPO, to maximize throughput on high-end NVIDIA GPUs. It prevents common performance bottlenecks—such as under-utilization of A100/H100 cores and disabled compilation—by implementing a layered configuration strategy that detects hardware tiers before applying training modes. By ensuring optimal parallel environment counts, appropriate minibatch sizes, and enabling torch.compile, it helps developers achieve 10x or greater speedups in training frames per second (FPS).

主要功能

FP8 precision support specifically for Hopper (H100) architecture
Integrated torch.compile setup for 3-6x speed improvements after warmup
Automatic GPU tier detection for H100, A100, and RTX 4090/3090
Layered configuration logic that preserves hardware optimizations across test/prod modes
Optimized environment scaling (n_envs) to ensure maximum GPU saturation
0 GitHub stars

使用场景

Standardizing ML training scripts to scale across different hardware tiers automatically
Accelerating RL agent training in high-performance environments like Isaac Gym
Debugging low GPU utilization and FPS drops on cloud-based A100 instances

关于

主要功能

FP8 precision support specifically for Hopper (H100) architecture
Integrated torch.compile setup for 3-6x speed improvements after warmup
Automatic GPU tier detection for H100, A100, and RTX 4090/3090
Layered configuration logic that preserves hardware optimizations across test/prod modes
Optimized environment scaling (n_envs) to ensure maximum GPU saturation
0 GitHub stars

使用场景

Standardizing ML training scripts to scale across different hardware tiers automatically
Accelerating RL agent training in high-performance environments like Isaac Gym
Debugging low GPU utilization and FPS drops on cloud-based A100 instances