Can it fix the training code automatically?

For safety, the skill operates in a read-only mode for analysis. It provides specific, evidence-backed recommendations and code templates that you can review and apply.

What types of ML failures can this skill diagnose?

The skill can identify a wide range of issues including loss divergence, mode collapse, exploding/vanishing gradients, architecture imbalances, and optimization failures.

Does this skill require specific log formats?

It is designed to be flexible. It works with standard training logs, loss CSVs, and configuration files, using scripts to extract key metrics across different frameworks.

Which ML frameworks are supported?

Because the skill uses a specialist agent to analyze code and artifacts, it is framework-agnostic and works with PyTorch, TensorFlow, JAX, and other popular libraries.

ML Training Debugger

Name: ML Training Debugger
Author: DNYoussef

byDNYoussef

•

Data Science & ML

Diagnoses and resolves machine learning training failures like loss divergence and gradient issues through automated artifact analysis.

The ML Training Debugger is a specialized skill for Claude Code designed to automate the diagnostic process for problematic machine learning training runs. It spawns a specialist agent to systematically analyze training logs, loss curves, model architecture, and gradient statistics to pinpoint the root causes of issues such as exploding gradients, mode collapse, or architecture imbalances. By providing evidence-based recommendations and actionable fixes, this skill helps researchers and engineers stabilize training pipelines and reduce the time spent on manual trial-and-error debugging.

Key Features

01Automated analysis of loss curves, training logs, and model checkpoints

02Evidence-based architectural and hyperparameter optimization recommendations

031 GitHub stars

04Systematic root cause identification for mode collapse and loss divergence

05Detection of gradient issues including exploding, vanishing, or stagnant gradients

06Built-in tools for GPU memory profiling and parameter distribution analysis

Use Cases

01Troubleshooting sudden loss divergence or NaN values during deep learning training

02Diagnosing degenerate outputs where models produce repetitive or single-token results

03Identifying bottlenecked architectures where embedding layers or parameters are imbalanced

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add dnyoussef/ai-chrome-extension ml-training-debugger

For use in Claude.ai and ChatGPT

Download Skill