Does this help troubleshoot 'Out of Memory' errors?

It provides troubleshooting steps to identify memory bottlenecks and scripts to pull smaller quantized models that fit within your GPU's VRAM.

Does this skill support both NVIDIA and AMD GPUs?

Yes, the skill includes logic for monitoring NVIDIA hardware via nvidia-smi and AMD hardware via rocm-smi.

Can I measure tokens per second with this skill?

Yes, it extracts metadata from Ollama inference responses to calculate precise tokens per second and evaluation durations.

How can I tell if a model is using CPU fallback?

The skill checks VRAM usage against total model size; if VRAM usage is significantly lower than the model size, it indicates the model is partially or fully running on the CPU.

GPU Monitoring for Ollama

Name: GPU Monitoring for Ollama
Author: atrawog

byatrawog

分析と監視

Monitors GPU performance, VRAM allocation, and inference metrics for Ollama models to optimize AI workloads.

概要

The GPU monitoring skill provides specialized tools for tracking and optimizing hardware performance when running Ollama inference. It enables developers to monitor NVIDIA and AMD GPU status, VRAM usage, and critical inference metrics such as tokens per second and evaluation durations. By providing automated health checks and troubleshooting guides for common issues like CPU fallback or Out-of-Memory (OOM) errors, this skill ensures that local AI models are running at peak efficiency on available hardware.

主な機能

Inference performance benchmarking (tokens/sec and eval timing)
Real-time monitoring for NVIDIA (SMI) and AMD (ROCm) hardware
Detailed VRAM tracking per model via Ollama API integration
Automated GPU health checks and container troubleshooting
Active utilization monitoring during live inference sessions
0 GitHub stars

ユースケース

Debugging Out-of-Memory (OOM) errors during model loading
Optimizing hardware allocation for concurrent model serving
Benchmarking LLM performance across different quantization levels

概要

主な機能

Inference performance benchmarking (tokens/sec and eval timing)
Real-time monitoring for NVIDIA (SMI) and AMD (ROCm) hardware
Detailed VRAM tracking per model via Ollama API integration
Automated GPU health checks and container troubleshooting
Active utilization monitoring during live inference sessions
0 GitHub stars

ユースケース

Debugging Out-of-Memory (OOM) errors during model loading
Optimizing hardware allocation for concurrent model serving
Benchmarking LLM performance across different quantization levels