What should I do if I see a 'Pod UID mismatch'?

This usually means a sandbox pod was killed and restarted; no manual fix is needed as Inspect will automatically retry the affected sample.

What does the debug-stuck-eval skill do?

It provides a structured workflow to diagnose and fix issues with UK AISI Inspect AI evaluations, specifically targeting hanging tasks and API errors.

How do I check if my evaluation is truly stuck?

Use the 'hawk status' command via this skill to view a JSON report of pod states, metrics, and sample completion status.

Can I resume an evaluation after deleting it?

Yes, because Inspect uses an S3 sample buffer, you can restart the evaluation and it will resume from where it left off unless you use the --no-resume flag.

Inspect Eval Debugger

Name: Inspect Eval Debugger
Author: METR

byMETR

•

분석 및 모니터링

Troubleshoots and resolves stuck or failing UK AISI Inspect AI evaluations on the Hawk platform.

The debug-stuck-eval skill provides a specialized diagnostic toolkit for researchers and developers using the UK AISI Inspect AI framework. It streamlines the process of identifying why evaluations hang or fail by analyzing pod states, log patterns, and sample completion status. The skill guides users through verifying authentication, checking retry loops, testing API connectivity through the Middleman proxy, and implementing recovery steps like S3 buffer-aware restarts to ensure evaluation continuity without data loss.

주요 기능

01Sample-level progress tracking to identify malformed responses

02Detection of common error patterns including OOMKilled pods and API retries

03Direct API connectivity testing via Middleman and provider endpoints

04Step-by-step recovery workflows for restarting stuck evaluations

0524 GitHub stars

06Automated status and log analysis for Hawk/Inspect evaluation sets

사용 사례

01Investigating high retry counts and latency in long-running tasks

02Diagnosing why an AI evaluation set is frozen or not progressing

03Troubleshooting 500 Internal Server errors in LLM API requests

주요 기능

01Sample-level progress tracking to identify malformed responses

02Detection of common error patterns including OOMKilled pods and API retries

03Direct API connectivity testing via Middleman and provider endpoints

04Step-by-step recovery workflows for restarting stuck evaluations

0524 GitHub stars

06Automated status and log analysis for Hawk/Inspect evaluation sets

사용 사례

01Investigating high retry counts and latency in long-running tasks

02Diagnosing why an AI evaluation set is frozen or not progressing

03Troubleshooting 500 Internal Server errors in LLM API requests