Can this skill help migrate from A100 to H100 on CoreWeave?

Yes, it specifically handles the node selector changes and resource request updates needed for transitioning from A100 to H100 GPU classes.

What happens if a migration fails during deployment?

The skill includes a defined rollback strategy that leverages Kubernetes rollout undo and strategic merge patches to revert to the previous stable state.

Does it support CUDA version upgrades?

Yes, it identifies CUDA driver requirements and assists in updating container base images to match the target GPU's compatibility matrix.

Can it detect deprecated GPU types in my current namespace?

Yes, it includes a version detection mechanism that audits active pods for stale or deprecated GPU classes like A100_PCIE_40GB.

Does it handle Kubernetes API version changes?

Yes, it can identify deprecated API versions in your manifests and update them to the current standards required by the CoreWeave platform.

CoreWeave GPU Migration

Name: CoreWeave GPU Migration
Author: jeremylongshore

byjeremylongshore

•

2,028

•

云基础设施

Automates the migration of GPU workloads and CUDA version upgrades on CoreWeave cloud infrastructure.

This skill streamlines the complex process of upgrading and migrating deployments within the CoreWeave ecosystem. It assists developers in transitioning between GPU types (such as moving from A100 to H100), updating CUDA driver versions, and managing Kubernetes manifest changes to ensure compatibility with CoreWeave's latest platform updates. By providing automated version detection, migration checklists, and rollback strategies, it minimizes downtime and prevents common scheduling failures associated with deprecated instance classes or resource quota mismatches.

主要功能

01Automated Kubernetes manifest updates for API and schema changes

022,028 GitHub stars

03Built-in rollback procedures for failed deployments using strategic merge patches

04Seamless transition mapping for node selectors (e.g., A100 to H100 SXM5)

05Automated detection of deprecated GPU instance types and CUDA versions

06Comprehensive error handling for quota issues and driver version mismatches

使用场景

01Auditing existing CoreWeave namespaces for deprecated hardware classes before platform sunset dates

02Upgrading legacy ML inference clusters from A100 to H100 GPUs for better performance

03Updating container base images and Kubernetes manifests to support the latest CUDA drivers

主要功能

01Automated Kubernetes manifest updates for API and schema changes

022,028 GitHub stars

03Built-in rollback procedures for failed deployments using strategic merge patches

04Seamless transition mapping for node selectors (e.g., A100 to H100 SXM5)

05Automated detection of deprecated GPU instance types and CUDA versions

06Comprehensive error handling for quota issues and driver version mismatches

使用场景

01Auditing existing CoreWeave namespaces for deprecated hardware classes before platform sunset dates

02Upgrading legacy ML inference clusters from A100 to H100 GPUs for better performance

03Updating container base images and Kubernetes manifests to support the latest CUDA drivers