Diagnoses and resolves stalled or failing Inspect AI evaluations by analyzing logs, pod states, and API connectivity within the Hawk cloud environment.
The debug-stuck-eval skill provides a specialized diagnostic toolkit for troubleshooting UK AISI Inspect evaluations. It enables Claude to identify why an evaluation set has frozen or is returning errors by verifying authentication, checking pod statuses via Hawk, and scanning logs for specific failure patterns like OOMKilled events or 500 Internal Server Errors. By testing API connectivity through the Middleman proxy and managing S3-backed buffers, this skill streamlines the process of recovering from infrastructure hiccups and API instability, ensuring evaluations complete successfully without manual intervention.
Key Features
01Authentication verification and troubleshooting for Hawk CLI access
02Smart recovery workflows for restarting evaluations using S3 buffer resumes
0324 GitHub stars
04Connectivity diagnostic tools for Middleman and direct model provider APIs
05Real-time status tracking for individual samples within an evaluation set
06Automated log analysis for common error patterns like Pod UID mismatch and OOMKilled
Use Cases
01Troubleshooting evaluations that are hanging or showing no progress in the status dashboard
02Recovering an evaluation that has crashed due to memory limits or context window overflows
03Identifying if an API error is caused by a proxy middleman or the upstream model provider