Diagnoses and resolves stalled or failing Inspect AI evaluations running on the METR Hawk platform.
The Hawk Evaluation Debugger is a specialized skill designed to troubleshoot UK AISI's Inspect AI evaluations when they hang, timeout, or return persistent errors in the cloud. It provides a structured workflow for verifying authentication, monitoring pod health, and analyzing logs for specific failure signatures like API retries, OOM errors, and proxy issues. By enabling direct API testing through the middleman proxy and providing recovery commands for S3 buffers, this skill helps researchers minimize downtime and ensure large-scale LLM evaluation sets reach completion.
主要功能
01Sample buffer management to resume evaluations without data loss
02Real-time status monitoring for Hawk evaluation sets and pod states
03Streamlined log retrieval and follow mode for active debugging
04Direct middleman proxy and provider API connectivity testing via curl
05Automated error pattern recognition for API retries, 500 errors, and OOM issues
0624 GitHub stars
使用场景
01Investigating 500 Internal Server Errors and 429 Rate Limit issues within the middleman proxy
02Diagnosing why an evaluation set is frozen or not progressing past a specific sample count
03Recovering and restarting failed evaluations using S3 buffer persistence