Troubleshoots and resolves stalled or hanging AI evaluations within the METR Hawk and Inspect AI ecosystem.
The debug-stuck-eval skill provides a specialized diagnostic toolkit for developers and researchers running AI evaluations via UK AISI's Inspect platform. It automates the process of identifying why evaluations might be frozen or failing, offering guided steps to verify authentication, analyze cloud logs for specific error patterns like OOM or API timeouts, and test connectivity through proxy services like Middleman. By interpreting complex retry loops and pod statuses, it helps users recover stuck evaluation sets and ensure sample completion without losing progress by leveraging S3 buffers.
主な機能
01Real-time status tracking for eval sets and individual sample pod completion
02Recognition of specific Inspect AI error patterns and OpenAI SDK retry behaviors
03Direct API connectivity testing via Middleman and provider-specific endpoints
04Automated log analysis for identifying OOMKilled pods, 500 errors, and retry loops
05Guided recovery workflows including eval restarts with sample buffer resumption
0624 GitHub stars
ユースケース
01Verifying if a 'stuck' evaluation is actually progressing via alternating fail-ok patterns
02Troubleshooting 500 Internal Server errors and rate limits during large-scale model evals
03Diagnosing why an AI evaluation set has stopped progressing or appears frozen