How much does it cost to run an evaluation?

Running a sample of 5 tasks typically costs between $2 and $5 in API fees, depending on task complexity and the specific model utilized.

What is the benefit of using this skill for MCP servers?

It provides a standardized way to test if your Model Context Protocol (MCP) server actually helps an AI agent solve complex coding problems better than a baseline configuration.

What is SWE-bench Lite?

SWE-bench Lite is a curated benchmark consisting of 300 software engineering tasks used to evaluate the capabilities of AI models in solving real-world GitHub issues.

Do I need Docker to run this skill?

Yes, Docker is required to create the isolated environments necessary for benchmark tasks to ensure execution consistency and security.

Where are the benchmark results saved?

By default, the skill saves full evaluation data to results.json and a human-readable summary to report.md in your current directory.

SWE-bench Lite Benchmark Runner

Name: SWE-bench Lite Benchmark Runner
Author: greynewell

bygreynewell

•

データサイエンスとML

Evaluates MCP servers using the SWE-bench Lite dataset to measure software engineering performance and accuracy.

This skill provides a streamlined interface for running SWE-bench Lite evaluations via the Model Context Protocol Benchmark Runner (mcpbr). It automates the benchmarking of MCP servers against real-world software engineering tasks, offering pre-configured defaults for sample size, reporting, and verbosity. It is an essential tool for developers building or refining AI agents and MCP servers who need quantifiable performance metrics, baseline comparisons, and detailed diagnostic logs to improve their model's problem-solving capabilities.

主な機能

01Detailed logging and cost/runtime estimation

02Comprehensive results reporting in JSON and Markdown formats

03Configurable sample sizing for quick tests or full evaluations

04Baseline comparison to track MCP server improvements

0520 GitHub stars

06Automated SWE-bench Lite evaluation runner

ユースケース

01Measuring the effectiveness of new MCP servers on real GitHub issues

02Detecting performance regressions during AI agent development

03Generating standardized benchmark reports for model-context protocol tools

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add greynewell/mcpbr benchmark-swe-lite

For use in Claude.ai and ChatGPT

Download Skill