How do I fix a 500 Internal Server Error in my eval?

The skill guides you to check the logs, test the API via middleman or directly with curl, and identify if the failure is provider-specific or a proxy issue.

Can I resume a stuck evaluation without losing progress?

Yes, by utilizing the S3 buffer and the hawk CLI recovery commands, evaluations can typically resume from where they left off unless the no-resume flag is used.

What is the Inspect AI Eval Debugger skill?

It is a specialized capability for Claude Code designed to help developers troubleshoot and fix hanging or failing evaluations in the UK AISI Inspect framework.

How do I check why an evaluation sample is frozen?

Use the hawk status and hawk logs commands via this skill to see pod state, memory issues like OOMKilled, or API retry loops that indicate instability.

Inspect AI Eval Debugger

Name: Inspect AI Eval Debugger
Author: METR

byMETR

•

Analytics & Monitoring

Diagnoses and resolves hanging or failing UK AISI Inspect AI evaluations running in the Hawk cloud environment.

The Inspect AI Eval Debugger skill provides a specialized diagnostic framework for troubleshooting stalled or failing model evaluations within the UK AISI Inspect ecosystem. It enables Claude to perform deep-dive analysis into Hawk evaluation sets by verifying authentication, monitoring pod health, and inspecting real-time logs for specific error signatures like OOMKilled events, 500 internal server errors, and API retry loops. By combining log analysis with direct API connectivity testing through middleman proxies, this skill helps developers identify whether issues stem from the provider, the proxy, or the evaluation configuration, while providing clear recovery paths to resume progress without data loss.

Key Features

01Streamlined log retrieval including support for follow-mode and S3 buffer access

02Real-time evaluation status monitoring and pod state analysis

03Automated identification of common error patterns like OOMKilled and rate limits

04Integrated API testing via curl to isolate middleman and provider issues

05Guided recovery workflows to delete and restart stuck evaluation sets with resume support

0624 GitHub stars

Use Cases

01Analyzing memory exhaustion and pod restarts in cloud-based Inspect runners

02Troubleshooting evaluations that are frozen or stuck in a retry loop

03Debugging 500 errors and API timeouts in large-scale model testing runs

Key Features

01Streamlined log retrieval including support for follow-mode and S3 buffer access

02Real-time evaluation status monitoring and pod state analysis

03Automated identification of common error patterns like OOMKilled and rate limits

04Integrated API testing via curl to isolate middleman and provider issues

05Guided recovery workflows to delete and restart stuck evaluation sets with resume support

0624 GitHub stars

Use Cases

01Analyzing memory exhaustion and pod restarts in cloud-based Inspect runners

02Troubleshooting evaluations that are frozen or stuck in a retry loop

03Debugging 500 errors and API timeouts in large-scale model testing runs