How do I check if my evaluation is truly stuck?

Use the `hawk status ` command to view the pod state and JSON report; high retry counts combined with long wait times often indicate a stall.

Will I lose progress if I delete and restart a stuck evaluation?

No, because Inspect AI stores a sample buffer in S3. When you restart the evaluation with the same configuration, it will resume from the last successful sample unless you explicitly use the --no-resume flag.

What should I do if I see a 'Pod UID mismatch' in the logs?

A Pod UID mismatch typically indicates a sandbox pod was killed and restarted. This is often handled automatically by Inspect AI retries and usually requires no manual fix.

How can I tell if a failure is caused by the middleman proxy or the model provider?

This skill provides specific curl commands to test the middleman internal endpoint directly. If the middleman fails but direct provider calls work, the issue is with the proxy.

Hawk Evaluation Debugger

Name: Hawk Evaluation Debugger
Author: METR

byMETR

•

分析与监控

Diagnoses and resolves stalled or failing Inspect AI evaluations running on the METR Hawk platform.

The Hawk Evaluation Debugger is a specialized skill designed to troubleshoot UK AISI's Inspect AI evaluations when they hang, timeout, or return persistent errors in the cloud. It provides a structured workflow for verifying authentication, monitoring pod health, and analyzing logs for specific failure signatures like API retries, OOM errors, and proxy issues. By enabling direct API testing through the middleman proxy and providing recovery commands for S3 buffers, this skill helps researchers minimize downtime and ensure large-scale LLM evaluation sets reach completion.

主要功能

01Sample buffer management to resume evaluations without data loss

02Real-time status monitoring for Hawk evaluation sets and pod states

03Streamlined log retrieval and follow mode for active debugging

04Direct middleman proxy and provider API connectivity testing via curl

05Automated error pattern recognition for API retries, 500 errors, and OOM issues

0624 GitHub stars

使用场景

01Investigating 500 Internal Server Errors and 429 Rate Limit issues within the middleman proxy

02Diagnosing why an evaluation set is frozen or not progressing past a specific sample count

03Recovering and restarting failed evaluations using S3 buffer persistence

主要功能

01Sample buffer management to resume evaluations without data loss

02Real-time status monitoring for Hawk evaluation sets and pod states

03Streamlined log retrieval and follow mode for active debugging

04Direct middleman proxy and provider API connectivity testing via curl

05Automated error pattern recognition for API retries, 500 errors, and OOM issues

0624 GitHub stars

使用场景

01Investigating 500 Internal Server Errors and 429 Rate Limit issues within the middleman proxy

02Diagnosing why an evaluation set is frozen or not progressing past a specific sample count

03Recovering and restarting failed evaluations using S3 buffer persistence