What does 'Pod UID mismatch' mean in the logs?

This indicates a sandbox pod was killed and restarted. It means a specific sample errored out, but Inspect AI is designed to retry it automatically.

What should I do if my eval shows a 500 error?

Use the skill to test the API directly via curl through the Middleman proxy. This helps determine if the issue lies with the model provider or the internal authentication proxy.

How can I tell if an evaluation is actually stuck or just slow?

The skill analyzes 'FAIL-OK' patterns in logs; alternating failures and successes usually mean the eval is progressing despite API instability.

Can I resume a stuck evaluation without losing data?

Yes, deleting and restarting the eval-set typically resumes from the S3 buffer automatically, ensuring you don't lose progress on completed samples.

How do I verify my Hawk authentication status?

The skill includes a quick checklist to verify your access token using 'hawk auth access-token' and provides the login command if it has expired.

Debug Stuck Evals

Name: Debug Stuck Evals
Author: METR

byMETR

•

分析と監視

Troubleshoots and resolves stalled or hanging AI evaluations within the METR Hawk and Inspect AI ecosystem.

The debug-stuck-eval skill provides a specialized diagnostic toolkit for developers and researchers running AI evaluations via UK AISI's Inspect platform. It automates the process of identifying why evaluations might be frozen or failing, offering guided steps to verify authentication, analyze cloud logs for specific error patterns like OOM or API timeouts, and test connectivity through proxy services like Middleman. By interpreting complex retry loops and pod statuses, it helps users recover stuck evaluation sets and ensure sample completion without losing progress by leveraging S3 buffers.

主な機能

01Real-time status tracking for eval sets and individual sample pod completion

02Recognition of specific Inspect AI error patterns and OpenAI SDK retry behaviors

03Direct API connectivity testing via Middleman and provider-specific endpoints

04Automated log analysis for identifying OOMKilled pods, 500 errors, and retry loops

05Guided recovery workflows including eval restarts with sample buffer resumption

0624 GitHub stars

ユースケース

01Verifying if a 'stuck' evaluation is actually progressing via alternating fail-ok patterns

02Troubleshooting 500 Internal Server errors and rate limits during large-scale model evals

03Diagnosing why an AI evaluation set has stopped progressing or appears frozen

主な機能

01Real-time status tracking for eval sets and individual sample pod completion

02Recognition of specific Inspect AI error patterns and OpenAI SDK retry behaviors

03Direct API connectivity testing via Middleman and provider-specific endpoints

04Automated log analysis for identifying OOMKilled pods, 500 errors, and retry loops

05Guided recovery workflows including eval restarts with sample buffer resumption

0624 GitHub stars

ユースケース

01Verifying if a 'stuck' evaluation is actually progressing via alternating fail-ok patterns

02Troubleshooting 500 Internal Server errors and rate limits during large-scale model evals

03Diagnosing why an AI evaluation set has stopped progressing or appears frozen