Diagnoses and resolves stalled or failing AI model evaluations within the METR Hawk and UK AISI Inspect frameworks.
The debug-stuck-eval skill provides a specialized diagnostic framework for troubleshooting AI model evaluations that have hung, timed out, or encountered persistent errors. It enables Claude to guide users through verifying authentication, checking pod status, and interpreting complex log patterns like retry loops or OOMKilled events. By combining low-level log analysis with direct API connectivity testing through middleman proxies, this skill helps identify whether bottlenecks are caused by token limits, infrastructure failures, or provider instability, ensuring that researchers can efficiently resume and complete critical model safety evaluations.
主要功能
01Automated identification of common Inspect AI error patterns and retry logs
02Advanced log parsing using the inspect_ai Python library
0324 GitHub stars
04Step-by-step Hawk infrastructure verification and pod status reporting
05Evaluation recovery workflows using S3 buffer resumes
06Direct API connectivity testing via middleman auth proxy scripts
使用场景
01Debugging 500 Internal Server errors and 400 token limit issues in model requests
02Recovering and restarting stalled evaluation sets without losing progress
03Troubleshooting evaluations that are frozen or hanging at a specific sample count