How do I check if my model is running on the GPU?

The skill provides a complete health check that compares the model's VRAM usage against its total size; if the VRAM usage is significantly lower, the model is likely using a CPU fallback.

Can this help with 'Out of Memory' (OOM) errors?

Yes, it includes specific troubleshooting steps and scripts to pull quantized or smaller models that better fit your available hardware memory.

Does it track inference speed?

Yes, it calculates wall-clock time, prompt evaluation duration, and tokens per second (TPS) for every generated response.

Does this work with both NVIDIA and AMD GPUs?

Yes, the skill includes support for both nvidia-smi (NVIDIA) and rocm-smi (AMD) to monitor hardware status across different manufacturers.

Ollama GPU Monitor

Name: Ollama GPU Monitor
Author: atrawog

byatrawog

データサイエンスとML

Monitors GPU performance, VRAM usage, and inference metrics for local Ollama models.

概要

The Ollama GPU Monitor skill provides specialized tools for managing hardware performance during local AI model execution. It allows users to track NVIDIA and AMD GPU status, monitor real-time VRAM consumption, verify model loading status via the Ollama API, and calculate precise inference performance metrics such as tokens per second. This skill is essential for developers debugging slow inference, managing resource constraints, or benchmarking different LLMs within a Bazzite AI environment.

主な機能

NVIDIA and AMD GPU status and utilization tracking
Detailed inference performance metrics including tokens per second
Automated health checks for hardware and Ollama server status
0 GitHub stars
Real-time VRAM monitoring for loaded Ollama models
Troubleshooting guides for OOM errors and slow inference

ユースケース

Optimizing model selection based on available VRAM and performance
Debugging why an LLM is falling back to CPU instead of using the GPU
Benchmarking different quantization levels for inference speed

概要

主な機能

NVIDIA and AMD GPU status and utilization tracking
Detailed inference performance metrics including tokens per second
Automated health checks for hardware and Ollama server status
0 GitHub stars
Real-time VRAM monitoring for loaded Ollama models
Troubleshooting guides for OOM errors and slow inference

ユースケース

Optimizing model selection based on available VRAM and performance
Debugging why an LLM is falling back to CPU instead of using the GPU
Benchmarking different quantization levels for inference speed