Can I resume a failed evaluation without losing data?

Yes, Inspect AI uses S3 buffers to save progress. Deleting and restarting the eval-set will allow the runner to resume from where it left off, provided the buffer is intact.

What does a 'stuck eval' mean in this context?

A stuck evaluation refers to a run that has stopped progressing, often indicated by hanging samples, repetitive retry loops in logs, or 500 errors within the Hawk environment.

What is the Middleman proxy mentioned in the skill?

Middleman is an internal authentication and routing proxy used to facilitate secure requests between the evaluation environment and AI providers like Anthropic or OpenAI.

How do I check if my evaluation is hanging?

You can use the 'hawk status ' command to view the JSON report containing pod states, and 'hawk logs -f' to monitor real-time progress for errors.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

분석 및 모니터링

Diagnoses and resolves stalled, hanging, or failing AI evaluations within the UK AISI Inspect framework.

The Inspect AI Evaluation Debugger is a specialized diagnostic tool for researchers and engineers running UK AISI Inspect evaluations in cloud environments. It provides a structured methodology for identifying why evaluations are hanging or failing, covering everything from authentication verification and pod log analysis to deep-dive API connectivity testing. By recognizing specific error patterns like OOMKills, token limit exceeded errors, and Middleman proxy issues, this skill enables users to quickly recover stuck runs using S3 buffer resumes and targeted infrastructure fixes.

주요 기능

01Automated status and log analysis for Hawk/Inspect AI evaluations

02Workflow guidance for S3 buffer access and .eval log extraction

03Detection of high HTTP retry counts indicating API instability

0424 GitHub stars

05Direct API connectivity testing through Middleman and model provider endpoints

06Pattern matching for common errors like OOMKills and Pod UID mismatches

사용 사례

01Diagnosing 500 Internal Server errors during large-scale model testing

02Troubleshooting 'stuck' or 'frozen' evaluations that aren't progressing

03Recovering and resuming evaluations after pod memory exhaustion

주요 기능

01Automated status and log analysis for Hawk/Inspect AI evaluations

02Workflow guidance for S3 buffer access and .eval log extraction

03Detection of high HTTP retry counts indicating API instability

0424 GitHub stars

05Direct API connectivity testing through Middleman and model provider endpoints

06Pattern matching for common errors like OOMKills and Pod UID mismatches

사용 사례

01Diagnosing 500 Internal Server errors during large-scale model testing

02Troubleshooting 'stuck' or 'frozen' evaluations that aren't progressing

03Recovering and resuming evaluations after pod memory exhaustion