What does 'Pod UID mismatch' mean in my logs?

This indicates a sandbox pod was killed and restarted. This is usually handled automatically by Inspect, which will retry the failed sample.

Why is my Inspect AI evaluation stuck?

Evaluations usually stick due to API rate limits, OOM errors in the runner pod, or auth proxy timeouts. Use 'hawk status' to check for pod mismatches or memory exhaustion.

How can I see the real error behind an OpenAI SDK retry?

The SDK often hides the root error; use the provided curl templates in this skill to test the API directly via the middleman proxy for full error visibility.

Will I lose progress if I delete and restart a stuck eval?

No, as long as you don't use the --no-resume flag. Inspect saves progress to an S3 buffer, allowing it to pick up where it left off after a restart.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

数据科学与机器学习

Diagnoses and resolves stalled or failing AI model evaluations within the METR Hawk and UK AISI Inspect frameworks.

The debug-stuck-eval skill provides a specialized diagnostic framework for troubleshooting AI model evaluations that have hung, timed out, or encountered persistent errors. It enables Claude to guide users through verifying authentication, checking pod status, and interpreting complex log patterns like retry loops or OOMKilled events. By combining low-level log analysis with direct API connectivity testing through middleman proxies, this skill helps identify whether bottlenecks are caused by token limits, infrastructure failures, or provider instability, ensuring that researchers can efficiently resume and complete critical model safety evaluations.

主要功能

01Automated identification of common Inspect AI error patterns and retry logs

02Advanced log parsing using the inspect_ai Python library

0324 GitHub stars

04Step-by-step Hawk infrastructure verification and pod status reporting

05Evaluation recovery workflows using S3 buffer resumes

06Direct API connectivity testing via middleman auth proxy scripts

使用场景

01Debugging 500 Internal Server errors and 400 token limit issues in model requests

02Recovering and restarting stalled evaluation sets without losing progress

03Troubleshooting evaluations that are frozen or hanging at a specific sample count

主要功能

01Automated identification of common Inspect AI error patterns and retry logs

02Advanced log parsing using the inspect_ai Python library

0324 GitHub stars

04Step-by-step Hawk infrastructure verification and pod status reporting

05Evaluation recovery workflows using S3 buffer resumes

06Direct API connectivity testing via middleman auth proxy scripts

使用场景

01Debugging 500 Internal Server errors and 400 token limit issues in model requests

02Recovering and restarting stalled evaluation sets without losing progress

03Troubleshooting evaluations that are frozen or hanging at a specific sample count