What does 'OOMKilled' in the pod status mean?

It indicates the container exhausted its assigned memory. To fix this, you need to increase the pod memory limits in your task configuration.

Why use curl to test the API instead of just reading logs?

OpenAI SDK logs often hide the specific error message during retries; testing via curl reveals the exact raw response from the middleman proxy or provider.

How do I check if a failure is due to a rate limit?

Check logs for the [RateLimitError 429] suffix or use the direct API test via curl provided in the skill guidance to see the raw provider response.

Will deleting a stuck eval lose all my progress?

No, as long as S3 buffering is enabled (the default), restarting the evaluation allows Inspect to resume from where it left off unless you explicitly use --no-resume.

What should I do if my eval shows 'pending: true' for a long time?

This often indicates a malformed API response or a hanging pod; you can usually restart the eval, and it will resume from the S3 buffer.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

分析と監視

Diagnoses and resolves hanging, stalled, or failing AI evaluations within the Hawk and Inspect AI framework.

This skill provides specialized diagnostics for troubleshooting UK AISI's Inspect AI evaluations that have stalled or encountered errors. It streamlines the debugging process by guiding users through authentication checks, pod status monitoring, log analysis for specific error patterns (like 500/400 codes or OOMKilled states), and direct API testing. By identifying whether bottlenecks stem from rate limits, context window exhaustion, or infrastructure issues, it ensures evaluations resume efficiently using S3-backed buffers and proper recovery commands.

主な機能

01Automated identification of error patterns like 429 rate limits and 500 internal errors

02Direct API connectivity testing via Middleman and provider proxies

03Recovery workflows for resuming evaluations from S3 buffers

04Real-time log streaming and sample completion tracking via Hawk CLI

0524 GitHub stars

06Comprehensive evaluation status and pod health monitoring

ユースケース

01Troubleshooting why an evaluation set has stopped progressing at a specific sample

02Investigating frequent 500 errors or retry loops in model response calls

03Verifying if infrastructure issues (OOMKilled) or API limits are causing eval timeouts

主な機能

01Automated identification of error patterns like 429 rate limits and 500 internal errors

02Direct API connectivity testing via Middleman and provider proxies

03Recovery workflows for resuming evaluations from S3 buffers

04Real-time log streaming and sample completion tracking via Hawk CLI

0524 GitHub stars

06Comprehensive evaluation status and pod health monitoring

ユースケース

01Troubleshooting why an evaluation set has stopped progressing at a specific sample

02Investigating frequent 500 errors or retry loops in model response calls

03Verifying if infrastructure issues (OOMKilled) or API limits are causing eval timeouts