Why is my Inspect AI evaluation stuck or not progressing?

Evaluations often hang due to API rate limits, provider outages, or resource exhaustion (OOM). This skill uses 'hawk status' and 'hawk logs' to identify the specific error pattern.

Do I lose my data if I delete and restart a stuck evaluation?

No, if you use the S3 buffer system, Inspect AI can resume from where it left off, provided you don't use the --no-resume flag during the restart.

What should I do if I see an 'OOMKilled' error in the pod status?

An OOMKilled error indicates the sandbox has run out of memory. You should increase the pod memory limits in your evaluation configuration and restart the task.

How can I tell if the issue is with the model provider or the internal proxy?

The skill provides specific curl commands to test the Middleman proxy directly; if direct calls work but the evaluation fails, the issue is likely in the framework configuration.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

분석 및 모니터링

Diagnoses and resolves hanging or failing AI model evaluations running on the Hawk and Inspect frameworks.

This skill provides a comprehensive toolkit for troubleshooting stuck AI evaluations, specifically targeting the Hawk and UK AISI Inspect AI environments. It enables developers to perform deep-dive diagnostics by verifying authentication, monitoring pod states, and analyzing logs for specific failure signatures like OOMKilled, rate limits, or malformed API responses. By guiding users through direct API connectivity testing via the Middleman proxy and managing S3-backed buffers, this skill ensures that complex evaluation sets can be recovered or resumed without data loss when pipelines stall.

주요 기능

0124 GitHub stars

02Automated identification of 400/500 API errors and resource exhaustion

03Direct API connectivity testing through Middleman and provider proxies

04Sample-level tracking and S3 buffer management for evaluation recovery

05Real-time log streaming and historical analysis for error pattern detection

06Comprehensive JSON status reporting for evaluation sets and pod health

사용 사례

01Troubleshooting an evaluation that has stopped progressing or is throwing persistent 500 errors

02Verifying if an evaluation delay is caused by rate limits or internal proxy connectivity issues

03Resuming a failed or stuck evaluation from a checkpoint without losing previous sample progress

주요 기능

0124 GitHub stars

02Automated identification of 400/500 API errors and resource exhaustion

03Direct API connectivity testing through Middleman and provider proxies

04Sample-level tracking and S3 buffer management for evaluation recovery

05Real-time log streaming and historical analysis for error pattern detection

06Comprehensive JSON status reporting for evaluation sets and pod health

사용 사례

01Troubleshooting an evaluation that has stopped progressing or is throwing persistent 500 errors

02Verifying if an evaluation delay is caused by rate limits or internal proxy connectivity issues

03Resuming a failed or stuck evaluation from a checkpoint without losing previous sample progress