What is SWE-bench Lite?

SWE-bench Lite is a curated subset of 300 real-world GitHub issues from popular open-source Python repositories, used as a gold standard for evaluating AI software engineering agents.

What files are generated after a run?

The skill generates a results.json file containing raw metrics and token usage, and a report.md file providing a human-readable summary of the resolution rate.

How long does a benchmark run take?

A small sample of 5 tasks typically takes 15 to 30 minutes, while a full evaluation of the entire dataset can take several hours depending on your hardware and API limits.

Do I need Docker to use this skill?

Yes, Docker must be running as the benchmark runner executes tasks within isolated containers to ensure security and environment consistency.

Can I test my own custom MCP server with this?

Absolutely. This skill is specifically designed to help you evaluate how much your custom MCP tools improve an agent's performance on the benchmark compared to a baseline.

SWE-bench Lite Evaluator

Name: SWE-bench Lite Evaluator
Author: supermodeltools

bysupermodeltools

•

데이터 과학 및 ML

Streamlines the execution of SWE-bench Lite evaluations to measure AI agent performance on real-world software engineering tasks.

The SWE-bench Lite Evaluator skill provides a standardized way to run the Model Context Protocol Benchmark Runner (mcpbr) against the SWE-bench Lite dataset. It automates the complex setup required for software engineering benchmarks, allowing developers to quickly test how effectively their AI models or MCP servers can resolve GitHub issues. With built-in support for Docker execution, automated reporting, and configurable task sampling, it serves as an essential tool for developers building and refining autonomous coding agents.

주요 기능

01Real-time verbose logging for visibility into agent reasoning and actions

02Automated execution of SWE-bench Lite tasks with sensible defaults

03Generates comprehensive results.json and human-readable report.md files

04Support for configurable sample sizes and specific task IDs

05Native integration with Model Context Protocol (MCP) server testing

068 GitHub stars

사용 사례

01Testing for regressions in coding capabilities after updating agent prompts

02Comparing performance metrics between different LLMs or MCP toolsets

03Benchmarking an AI agent's ability to fix real Python library bugs

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add supermodeltools/mcpbr benchmark-swe-lite

For use in Claude.ai and ChatGPT

Download Skill