Diagnoses and resolves stalled or failing AI evaluations running on the Hawk and Inspect AI cloud platform.
The Debug Stuck Eval skill provides a structured diagnostic framework for troubleshooting AI evaluations that have frozen, timed out, or encountered errors within the METR Hawk and UK AISI Inspect AI ecosystem. It enables Claude to analyze evaluation status reports, parse complex log patterns for specific failure modes like OOMKilled or API rate limits, and perform direct connectivity tests via the Middleman proxy. By guiding the user through authentication checks, S3 buffer management, and recovery commands, this skill significantly reduces the time spent debugging infrastructure issues during large-scale model benchmarking and safety evaluations.
主な機能
01Sample buffer management to enable evaluation resumption without data loss
02Detection of common error patterns including 500 errors, rate limits, and OOMKilled pods
03Structured recovery workflows for deleting and restarting hanging evaluations
0424 GitHub stars
05Automated Hawk status and log analysis for specific evaluation sets
06Direct API connectivity testing through Middleman and direct provider endpoints
ユースケース
01Investigating evaluations that are frozen or showing no progress in sample completion
02Debugging high retry counts and 'Internal server error' messages in model API calls
03Verifying whether evaluation failures are caused by proxy auth issues or provider downtime