01Detailed logging and cost/runtime estimation
02Comprehensive results reporting in JSON and Markdown formats
03Configurable sample sizing for quick tests or full evaluations
04Baseline comparison to track MCP server improvements
0520 GitHub stars
06Automated SWE-bench Lite evaluation runner