Diagnoses and resolves stalled or failing AI model evaluations running on METR's Inspect framework.
This skill provides specialized commands and troubleshooting patterns for managing UK AISI Inspect evaluations via the Hawk CLI. It enables developers to monitor evaluation progress, inspect pod logs, identify API bottlenecks through middleman proxy testing, and recover stalled samples using S3 buffers. It is particularly useful when evaluations hang, experience 500 errors, or face rate limits, offering clear resolution paths for complex distributed evaluation environments.
Características Principales
01Automated diagnostic checklists for OOMKilled pods and context limit errors
0224 GitHub stars
03Detailed log analysis to identify specific API error patterns and retry loops
04Direct API testing for Middleman auth proxies and model providers
05Real-time status monitoring and pod metric tracking via Hawk CLI
06Buffer management using S3 to resume interrupted evaluation sets
Casos de Uso
01Verifying authentication and connectivity between the evaluation runner and model providers
02Recovering a failed evaluation from a checkpoint without losing previous sample progress
03Troubleshooting an AI evaluation set that has stopped progressing or is throwing persistent 500 errors