How do I check the status of a specific evaluation?

You can use the 'hawk status ' command to generate a JSON report of current pod states, logs, and metrics.

What should I do if I see a 'Pod UID mismatch' error?

This indicates a sandbox pod was killed and restarted. Usually, no manual fix is required as the Inspect runner will retry the sample automatically.

Can I resume an evaluation after it has been stopped?

Yes, by using the sample buffer in S3, Inspect can resume evaluations from where they left off unless the --no-resume flag is specified.

How do I test if the issue is with the Middleman proxy?

The skill provides specific curl commands to test the Middleman internal endpoint versus direct provider calls to isolate authentication or proxy-layer failures.

What is the primary use of the debug-stuck-eval skill?

It is used to diagnose why an Inspect AI evaluation has stopped progressing, whether due to API errors, memory issues, or authentication failures.

Inspect AI Eval Debugger

Name: Inspect AI Eval Debugger
Author: METR

byMETR

•

Analíticas y Monitorización

Diagnoses and resolves stalled or failing AI model evaluations running on METR's Inspect framework.

This skill provides specialized commands and troubleshooting patterns for managing UK AISI Inspect evaluations via the Hawk CLI. It enables developers to monitor evaluation progress, inspect pod logs, identify API bottlenecks through middleman proxy testing, and recover stalled samples using S3 buffers. It is particularly useful when evaluations hang, experience 500 errors, or face rate limits, offering clear resolution paths for complex distributed evaluation environments.

Características Principales

01Automated diagnostic checklists for OOMKilled pods and context limit errors

0224 GitHub stars

03Detailed log analysis to identify specific API error patterns and retry loops

04Direct API testing for Middleman auth proxies and model providers

05Real-time status monitoring and pod metric tracking via Hawk CLI

06Buffer management using S3 to resume interrupted evaluation sets

Casos de Uso

01Verifying authentication and connectivity between the evaluation runner and model providers

02Recovering a failed evaluation from a checkpoint without losing previous sample progress

03Troubleshooting an AI evaluation set that has stopped progressing or is throwing persistent 500 errors

Características Principales

01Automated diagnostic checklists for OOMKilled pods and context limit errors

0224 GitHub stars

03Detailed log analysis to identify specific API error patterns and retry loops

04Direct API testing for Middleman auth proxies and model providers

05Real-time status monitoring and pod metric tracking via Hawk CLI

06Buffer management using S3 to resume interrupted evaluation sets

Casos de Uso

01Verifying authentication and connectivity between the evaluation runner and model providers

02Recovering a failed evaluation from a checkpoint without losing previous sample progress

03Troubleshooting an AI evaluation set that has stopped progressing or is throwing persistent 500 errors