Troubleshoots and resolves hung or failing UK AISI Inspect AI evaluations running in the Hawk cloud environment.
The Inspect AI Evaluation Debugger is a specialized skill designed to monitor, diagnose, and recover stuck model evaluations within the METR Hawk ecosystem. It streamlines the debugging process by identifying common failure patterns such as API rate limits, memory exhaustion (OOMKilled), and authentication errors. By providing structured workflows for log analysis, direct API testing via middleman proxies, and S3-backed buffer management, it ensures that large-scale AI benchmarking runs can resume efficiently and reach completion.
주요 기능
0124 GitHub stars
02Real-time evaluation status tracking and pod state monitoring
03Direct API connectivity testing through middleman authentication proxies
04Log streaming and sample completion status analysis
05Recovery workflows for resuming evaluations from S3-stored buffers
06Automated error pattern recognition for 400, 500, and OOM errors
사용 사례
01Troubleshooting transient 500 Internal Server errors and API timeouts
02Identifying the root cause of AI model evaluations that have stopped progressing
03Verifying authentication and connectivity between Hawk runners and model providers