Self-Healing Infrastructure for AI: Auto-Recovery & GPU