How do I know if an evaluation is actually stuck?

Check for consistent 'FAIL-FAIL-FAIL' patterns in the logs or high retry counts combined with zero progress in the sample list via the hawk status command.

What does a 'Pod UID mismatch' error mean?

This indicates a sandbox pod was killed and restarted. No manual fix is usually needed as Inspect AI will automatically retry the sample.

Can I resume a stuck evaluation without losing my progress?

Yes, by using the hawk delete and hawk eval-set commands, Inspect AI can resume from the .buffer directory stored in S3 unless --no-resume is specified.

How do I test if the Middleman proxy is causing evaluation hangs?

The skill provides specific curl commands to compare the response from the Middleman internal proxy against direct calls to the model provider (OpenAI/Anthropic).

What should I do if I see an OOMKilled status?

OOMKilled indicates memory exhaustion in the runner pod. You will need to increase the pod memory limits in your evaluation configuration.

Inspect AI Eval Debugger

Name: Inspect AI Eval Debugger
Author: METR

byMETR

•

分析と監視

Diagnoses and resolves stalled or failing AI evaluations running on the Hawk and Inspect AI cloud platform.

The Debug Stuck Eval skill provides a structured diagnostic framework for troubleshooting AI evaluations that have frozen, timed out, or encountered errors within the METR Hawk and UK AISI Inspect AI ecosystem. It enables Claude to analyze evaluation status reports, parse complex log patterns for specific failure modes like OOMKilled or API rate limits, and perform direct connectivity tests via the Middleman proxy. By guiding the user through authentication checks, S3 buffer management, and recovery commands, this skill significantly reduces the time spent debugging infrastructure issues during large-scale model benchmarking and safety evaluations.

主な機能

01Sample buffer management to enable evaluation resumption without data loss

02Detection of common error patterns including 500 errors, rate limits, and OOMKilled pods

03Structured recovery workflows for deleting and restarting hanging evaluations

0424 GitHub stars

05Automated Hawk status and log analysis for specific evaluation sets

06Direct API connectivity testing through Middleman and direct provider endpoints

ユースケース

01Investigating evaluations that are frozen or showing no progress in sample completion

02Debugging high retry counts and 'Internal server error' messages in model API calls

03Verifying whether evaluation failures are caused by proxy auth issues or provider downtime

主な機能

01Sample buffer management to enable evaluation resumption without data loss

02Detection of common error patterns including 500 errors, rate limits, and OOMKilled pods

03Structured recovery workflows for deleting and restarting hanging evaluations

0424 GitHub stars

05Automated Hawk status and log analysis for specific evaluation sets

06Direct API connectivity testing through Middleman and direct provider endpoints

ユースケース

01Investigating evaluations that are frozen or showing no progress in sample completion

02Debugging high retry counts and 'Internal server error' messages in model API calls

03Verifying whether evaluation failures are caused by proxy auth issues or provider downtime