01Real-time verbose logging for visibility into agent reasoning and actions
02Automated execution of SWE-bench Lite tasks with sensible defaults
03Generates comprehensive results.json and human-readable report.md files
04Support for configurable sample sizes and specific task IDs
05Native integration with Model Context Protocol (MCP) server testing
068 GitHub stars