01Regression gating that auto-commits net improvements and reverts performance drops
0238 GitHub stars
03Continuous EVAL-ANALYZE-RESEARCH-IMPROVE-DECIDE loop for automated agent evolution
04Multi-SDK benchmarking support for Mini, Claude, Copilot, and Microsoft implementations
05L1-L12 progressive test suite for measuring complex reasoning and task execution
06Automated failure taxonomy mapping to identify specific code or prompt weaknesses