01Standardized AILANG benchmark creation and YAML validation
02Automated scripts for testing benchmarks against specific Claude models
03Advanced prompt debugging to visualize exactly what models see during evaluation
04Comprehensive capability testing for I/O, File Systems, and Networking
05Built-in checks for the critical 'task_prompt' vs 'prompt' configuration
0620 GitHub stars