Diagnoses and resolves hanging or failing AI model evaluations running on the Hawk and Inspect frameworks.
This skill provides a comprehensive toolkit for troubleshooting stuck AI evaluations, specifically targeting the Hawk and UK AISI Inspect AI environments. It enables developers to perform deep-dive diagnostics by verifying authentication, monitoring pod states, and analyzing logs for specific failure signatures like OOMKilled, rate limits, or malformed API responses. By guiding users through direct API connectivity testing via the Middleman proxy and managing S3-backed buffers, this skill ensures that complex evaluation sets can be recovered or resumed without data loss when pipelines stall.
주요 기능
0124 GitHub stars
02Automated identification of 400/500 API errors and resource exhaustion
03Direct API connectivity testing through Middleman and provider proxies
04Sample-level tracking and S3 buffer management for evaluation recovery
05Real-time log streaming and historical analysis for error pattern detection
06Comprehensive JSON status reporting for evaluation sets and pod health
사용 사례
01Troubleshooting an evaluation that has stopped progressing or is throwing persistent 500 errors
02Verifying if an evaluation delay is caused by rate limits or internal proxy connectivity issues
03Resuming a failed or stuck evaluation from a checkpoint without losing previous sample progress