How do I fix a 'Pod UID mismatch' error?

A Pod UID mismatch typically means a sandbox pod was killed and restarted. This is usually handled automatically by Inspect AI's retry logic, so no manual fix is required.

Why does the OpenAI SDK hide the actual error message during retries?

The SDK's internal retry mechanism often masks the underlying error. This skill provides curl commands to query the API directly, revealing the true 400 or 500 error details.

What should I do if an evaluation shows high HTTP retry counts?

High retry counts indicate API instability. You should test the API directly using the provided curl commands to determine if the issue lies with the Middleman proxy or the model provider.

Does restarting a Hawk evaluation lose my progress?

No, because Hawk uses an S3-backed buffer. Inspect AI can resume from the last successful sample unless you explicitly use the --no-resume flag.

What is the Middleman proxy in this context?

Middleman is the internal authentication proxy for METR. This skill helps determine if an evaluation failure is caused by this proxy or an external API provider.

Inspect AI Eval Debugger

Name: Inspect AI Eval Debugger
Author: METR

byMETR

•

Analytics & Monitoring

Diagnoses and resolves stalled or failing Inspect AI evaluations by analyzing logs, pod states, and API connectivity within the Hawk cloud environment.

The debug-stuck-eval skill provides a specialized diagnostic toolkit for troubleshooting UK AISI Inspect evaluations. It enables Claude to identify why an evaluation set has frozen or is returning errors by verifying authentication, checking pod statuses via Hawk, and scanning logs for specific failure patterns like OOMKilled events or 500 Internal Server Errors. By testing API connectivity through the Middleman proxy and managing S3-backed buffers, this skill streamlines the process of recovering from infrastructure hiccups and API instability, ensuring evaluations complete successfully without manual intervention.

Key Features

01Authentication verification and troubleshooting for Hawk CLI access

02Smart recovery workflows for restarting evaluations using S3 buffer resumes

0324 GitHub stars

04Connectivity diagnostic tools for Middleman and direct model provider APIs

05Real-time status tracking for individual samples within an evaluation set

06Automated log analysis for common error patterns like Pod UID mismatch and OOMKilled

Use Cases

01Troubleshooting evaluations that are hanging or showing no progress in the status dashboard

02Recovering an evaluation that has crashed due to memory limits or context window overflows

03Identifying if an API error is caused by a proxy middleman or the upstream model provider

Key Features

01Authentication verification and troubleshooting for Hawk CLI access

02Smart recovery workflows for restarting evaluations using S3 buffer resumes

0324 GitHub stars

04Connectivity diagnostic tools for Middleman and direct model provider APIs

05Real-time status tracking for individual samples within an evaluation set

06Automated log analysis for common error patterns like Pod UID mismatch and OOMKilled

Use Cases

01Troubleshooting evaluations that are hanging or showing no progress in the status dashboard

02Recovering an evaluation that has crashed due to memory limits or context window overflows

03Identifying if an API error is caused by a proxy middleman or the upstream model provider