Diagnoses and resolves hanging, stalled, or failing AI evaluations within the Hawk and Inspect AI framework.
This skill provides specialized diagnostics for troubleshooting UK AISI's Inspect AI evaluations that have stalled or encountered errors. It streamlines the debugging process by guiding users through authentication checks, pod status monitoring, log analysis for specific error patterns (like 500/400 codes or OOMKilled states), and direct API testing. By identifying whether bottlenecks stem from rate limits, context window exhaustion, or infrastructure issues, it ensures evaluations resume efficiently using S3-backed buffers and proper recovery commands.
主な機能
01Automated identification of error patterns like 429 rate limits and 500 internal errors
02Direct API connectivity testing via Middleman and provider proxies
03Recovery workflows for resuming evaluations from S3 buffers
04Real-time log streaming and sample completion tracking via Hawk CLI
0524 GitHub stars
06Comprehensive evaluation status and pod health monitoring
ユースケース
01Troubleshooting why an evaluation set has stopped progressing at a specific sample
02Investigating frequent 500 errors or retry loops in model response calls
03Verifying if infrastructure issues (OOMKilled) or API limits are causing eval timeouts