Automates the migration of GPU workloads and CUDA version upgrades on CoreWeave cloud infrastructure.
This skill streamlines the complex process of upgrading and migrating deployments within the CoreWeave ecosystem. It assists developers in transitioning between GPU types (such as moving from A100 to H100), updating CUDA driver versions, and managing Kubernetes manifest changes to ensure compatibility with CoreWeave's latest platform updates. By providing automated version detection, migration checklists, and rollback strategies, it minimizes downtime and prevents common scheduling failures associated with deprecated instance classes or resource quota mismatches.
主要功能
01Automated Kubernetes manifest updates for API and schema changes
022,028 GitHub stars
03Built-in rollback procedures for failed deployments using strategic merge patches
04Seamless transition mapping for node selectors (e.g., A100 to H100 SXM5)
05Automated detection of deprecated GPU instance types and CUDA versions
06Comprehensive error handling for quota issues and driver version mismatches
使用场景
01Auditing existing CoreWeave namespaces for deprecated hardware classes before platform sunset dates
02Upgrading legacy ML inference clusters from A100 to H100 GPUs for better performance
03Updating container base images and Kubernetes manifests to support the latest CUDA drivers