How can I tell if an evaluation is actually failing or just slow?

This skill analyzes 'FAIL-OK' patterns; alternating failures and successes usually indicate progress, whereas consistent 'FAIL-FAIL-FAIL' patterns signal a terminal issue needing intervention.

What is a 'stuck' evaluation in this context?

An evaluation is considered stuck when progress logs show no movement, samples remain 'pending' indefinitely, or the system is caught in a consistent retry loop without completing tasks.

Will I lose progress if I delete and restart a stuck evaluation?

No, because Inspect AI uses S3 for sample buffering, the system can typically resume from where it left off once restarted, provided the buffer is intact.

How does this skill handle 500 Internal Server errors?

It provides diagnostic curl commands to test the Middleman auth proxy and direct provider endpoints, helping you isolate whether the issue is internal infrastructure or the AI provider.

Inspect AI Evaluation Debugger

Name: Inspect AI Evaluation Debugger
Author: METR

byMETR

•

분석 및 모니터링

Diagnoses and resolves stalled, hanging, or failing AI evaluations within the Hawk and Inspect AI frameworks.

The Inspect AI Evaluation Debugger is a specialized diagnostic skill designed to troubleshoot hung or failing model evaluations on the METR Hawk platform. It provides a structured workflow to verify authentication, monitor pod states, analyze logs for error patterns like 'OOMKilled' or 'RateLimitError,' and perform direct API connectivity tests through middleman proxies. By offering specific recovery steps and techniques for accessing S3-backed sample buffers, this skill helps developers and researchers minimize downtime and ensure the successful completion of complex AI benchmarking tasks.

주요 기능

01Resource utilization tracking to detect memory exhaustion and OOM kills

02Recovery workflows for restarting stuck evaluations using S3-backed sample buffers

03Real-time evaluation status reporting and pod state monitoring

04Direct API connectivity testing via Middleman and provider-specific curl commands

0524 GitHub stars

06Automated log analysis to identify API retry patterns and error codes

사용 사례

01Troubleshooting evaluations that are frozen or stuck in an infinite retry loop

02Diagnosing 500 Internal Server errors or token limit exceptions during high-volume runs

03Validating connectivity and authentication between model providers and the Hawk environment

주요 기능

01Resource utilization tracking to detect memory exhaustion and OOM kills

02Recovery workflows for restarting stuck evaluations using S3-backed sample buffers

03Real-time evaluation status reporting and pod state monitoring

04Direct API connectivity testing via Middleman and provider-specific curl commands

0524 GitHub stars

06Automated log analysis to identify API retry patterns and error codes

사용 사례

01Troubleshooting evaluations that are frozen or stuck in an infinite retry loop

02Diagnosing 500 Internal Server errors or token limit exceptions during high-volume runs

03Validating connectivity and authentication between model providers and the Hawk environment