Diagnoses and resolves stalled, hanging, or failing AI evaluations within the Hawk and Inspect AI frameworks.
The Inspect AI Evaluation Debugger is a specialized diagnostic skill designed to troubleshoot hung or failing model evaluations on the METR Hawk platform. It provides a structured workflow to verify authentication, monitor pod states, analyze logs for error patterns like 'OOMKilled' or 'RateLimitError,' and perform direct API connectivity tests through middleman proxies. By offering specific recovery steps and techniques for accessing S3-backed sample buffers, this skill helps developers and researchers minimize downtime and ensure the successful completion of complex AI benchmarking tasks.
주요 기능
01Resource utilization tracking to detect memory exhaustion and OOM kills
02Recovery workflows for restarting stuck evaluations using S3-backed sample buffers
03Real-time evaluation status reporting and pod state monitoring
04Direct API connectivity testing via Middleman and provider-specific curl commands
0524 GitHub stars
06Automated log analysis to identify API retry patterns and error codes
사용 사례
01Troubleshooting evaluations that are frozen or stuck in an infinite retry loop
02Diagnosing 500 Internal Server errors or token limit exceptions during high-volume runs
03Validating connectivity and authentication between model providers and the Hawk environment