How does the OOM retry pattern work?

It catches specific GPU Out of Memory exceptions during model loading, triggers a cache clear, and waits for a set duration (allowing other services to auto-unload) before attempting the operation again.

Can I use this with local LLMs like Ollama?

Yes, the skill includes specific configuration settings for Ollama's keep-alive parameters to ensure it releases VRAM quickly for other tasks.

What is the benefit of the signaling protocol?

The signaling protocol allows services to proactively request that other idle processes release their GPU memory, reducing wait times and making resource allocation more predictable.

Is this skill specific to a particular GPU brand?

While the code examples focus on NVIDIA GPUs (CUDA) and PyTorch, the logic and signaling patterns are applicable to any GPU-accelerated environment.

VRAM & GPU OOM Management

Name: VRAM & GPU OOM Management
Author: lawless-m

bylawless-m

Data Science & ML

Optimizes GPU VRAM usage through OOM retry logic, idle auto-unloading, and inter-service memory coordination protocols.

About

This skill provides a comprehensive framework for managing limited GPU VRAM across multiple concurrent services like Ollama, Whisper, and ComfyUI. It implements robust patterns for catching 'Out of Memory' (OOM) errors, executing timed retries, and configuring aggressive auto-unload behaviors for idle models. By establishing a service signaling protocol, it allows disparate AI applications to politely request memory from one another, ensuring smooth hardware sharing in complex local or server-side AI workflows without the need for a centralized orchestrator.

Key Features

Inter-service signaling protocol with /request-unload endpoints
0 GitHub stars
Configurable idle auto-unload patterns for multiple AI backends
Passive and active memory management strategies
Resilient OOM retry logic for PyTorch and Transformers
Pre-configured setup guides for Ollama and ComfyUI/Flux

Use Cases

Running LLMs and Image Generation models simultaneously on a single consumer GPU
Preventing pipeline failures in automated AI workflows when VRAM is contested
Implementing memory-efficient inference servers using FastAPI and PyTorch

About

Key Features

Inter-service signaling protocol with /request-unload endpoints
0 GitHub stars
Configurable idle auto-unload patterns for multiple AI backends
Passive and active memory management strategies
Resilient OOM retry logic for PyTorch and Transformers
Pre-configured setup guides for Ollama and ComfyUI/Flux

Use Cases

Running LLMs and Image Generation models simultaneously on a single consumer GPU
Preventing pipeline failures in automated AI workflows when VRAM is contested
Implementing memory-efficient inference servers using FastAPI and PyTorch