How do I know if my evaluation is actually stuck?

Use the skill to run 'hawk status' and check the JSON report for pod state. If logs show consistent 'FAIL-FAIL-FAIL' patterns or high HTTP retry counts without progress, the eval is likely stuck.

How do I test if the issue is with the API proxy or the provider?

The skill provides specific curl commands to test connectivity through the Middleman internal proxy versus direct calls to Anthropic or OpenAI to isolate the failure point.

Can I recover progress from a crashed evaluation?

Yes. By deleting the stuck eval and restarting the config, Inspect AI uses the S3 buffer to resume from where it left off, provided the '--no-resume' flag wasn't used.

What should I do if I see OOMKilled in the pod status?

OOMKilled indicates memory exhaustion. You should increase the pod memory limits in your configuration and restart the evaluation.

What does a 'Pod UID mismatch' error mean?

This indicates a sandbox pod was killed and restarted. Usually, no manual fix is needed as the sample will error out and Inspect will automatically retry.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

Analíticas y Monitorización

Diagnoses and resolves hanging or failing Inspect AI evaluations within Hawk cloud environments.

The debug-stuck-eval skill provides a specialized diagnostic framework for developers running UK AISI's Inspect AI framework in the cloud. It streamlines the troubleshooting process for evaluations that have stalled, timed out, or returned persistent 500 errors by providing structured workflows for checking pod health, analyzing log retry patterns, and testing API connectivity through middleman proxies. Whether dealing with OOM errors, token limit issues, or malformed API responses, this skill offers the specific Hawk CLI commands and curl tests needed to identify the root cause and safely resume evaluation runs from S3 buffers.

Características Principales

01Step-by-step diagnostic checklist for Hawk cloud authentication and pod status

0224 GitHub stars

03Connectivity testing scripts for Middleman proxies and direct model providers

04Safe recovery procedures to restart stuck evaluations using S3 buffer resumes

05Resource monitoring guidance to identify OOMKilled pods and memory exhaustion

06Log pattern identification for OpenAI SDK retries and Inspect-specific errors

Casos de Uso

01Troubleshooting AI evaluations that are hanging or frozen at specific sample counts

02Identifying if a bottleneck is caused by the model provider, proxy, or cloud infrastructure

03Investigating 500 Internal Server errors and API instability during large-scale runs

Características Principales

01Step-by-step diagnostic checklist for Hawk cloud authentication and pod status

0224 GitHub stars

03Connectivity testing scripts for Middleman proxies and direct model providers

04Safe recovery procedures to restart stuck evaluations using S3 buffer resumes

05Resource monitoring guidance to identify OOMKilled pods and memory exhaustion

06Log pattern identification for OpenAI SDK retries and Inspect-specific errors

Casos de Uso

01Troubleshooting AI evaluations that are hanging or frozen at specific sample counts

02Identifying if a bottleneck is caused by the model provider, proxy, or cloud infrastructure

03Investigating 500 Internal Server errors and API instability during large-scale runs