Will restarting an evaluation lose all my progress?

No, because Inspect uses an S3-based sample buffer, you can delete a stuck evaluation and restart it; it will automatically resume from the last completed sample unless --no-resume is specified.

What should I do if I see high HTTP retry counts?

High retry counts indicate API instability. You should test the API proxy directly using the provided curl commands to determine if the issue is with the Middleman proxy or the model provider.

What does an OOMKilled status mean in my eval logs?

OOMKilled indicates that the runner pod has exhausted its allocated memory. To fix this, you need to increase the pod memory limits in your task configuration and restart the evaluation.

How do I check if my evaluation is truly stuck?

Use the hawk status command followed by your evaluation ID to generate a JSON report. This will show pod states and sample completion metrics to confirm if progress is being made.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

분석 및 모니터링

Diagnoses and resolves hanging or failing AI model evaluations using Hawk and the UK AISI Inspect framework.

The Inspect AI Evaluation Debugger is a specialized skill designed to troubleshoot stalled or failing model evaluation runs in the cloud. It provides a structured diagnostic workflow for the Hawk platform, enabling users to verify authentication, monitor evaluation status, and analyze logs for specific error patterns like rate limits, OOM errors, and API proxy failures. By facilitating direct API testing and providing instructions for buffer-based recovery, this skill helps developers ensure that long-running AI evaluations reach completion even when faced with infrastructure instability or provider-side errors.

주요 기능

01Automated log pattern matching for 500 errors and timeout detection

02Real-time evaluation status monitoring and pod state analysis

03Memory exhaustion (OOMKilled) diagnostic and resolution guidance

04Evaluation resumption and recovery management using S3 buffer synchronization

0524 GitHub stars

06Direct API proxy testing via Middleman to isolate connectivity issues

사용 사례

01Troubleshooting evaluations that are stuck in retry loops or hanging indefinitely

02Diagnosing 500 Internal Server errors during large-scale model testing runs

03Verifying whether evaluation failures are caused by proxy issues or provider rate limits

주요 기능

01Automated log pattern matching for 500 errors and timeout detection

02Real-time evaluation status monitoring and pod state analysis

03Memory exhaustion (OOMKilled) diagnostic and resolution guidance

04Evaluation resumption and recovery management using S3 buffer synchronization

0524 GitHub stars

06Direct API proxy testing via Middleman to isolate connectivity issues

사용 사례

01Troubleshooting evaluations that are stuck in retry loops or hanging indefinitely

02Diagnosing 500 Internal Server errors during large-scale model testing runs

03Verifying whether evaluation failures are caused by proxy issues or provider rate limits