How much does it cost to run an evaluation?

Running a sample of 5 tasks typically costs between $2 and $5 in API fees, depending on task complexity and the specific model utilized.

What is the benefit of using this skill for MCP servers?

It provides a standardized way to test if your Model Context Protocol (MCP) server actually helps an AI agent solve complex coding problems better than a baseline configuration.

What is SWE-bench Lite?

SWE-bench Lite is a curated benchmark consisting of 300 software engineering tasks used to evaluate the capabilities of AI models in solving real-world GitHub issues.

Do I need Docker to run this skill?

Yes, Docker is required to create the isolated environments necessary for benchmark tasks to ensure execution consistency and security.

Where are the benchmark results saved?

By default, the skill saves full evaluation data to results.json and a human-readable summary to report.md in your current directory.

SWE-bench Lite Benchmark Runner

Name: SWE-bench Lite Benchmark Runner
Author: greynewell

bygreynewell

•

数据科学与机器学习

Evaluates MCP servers using the SWE-bench Lite dataset to measure software engineering performance and accuracy.

关于

This skill provides a streamlined interface for running SWE-bench Lite evaluations via the Model Context Protocol Benchmark Runner (mcpbr). It automates the benchmarking of MCP servers against real-world software engineering tasks, offering pre-configured defaults for sample size, reporting, and verbosity. It is an essential tool for developers building or refining AI agents and MCP servers who need quantifiable performance metrics, baseline comparisons, and detailed diagnostic logs to improve their model's problem-solving capabilities.

主要功能

Detailed logging and cost/runtime estimation
Comprehensive results reporting in JSON and Markdown formats
Configurable sample sizing for quick tests or full evaluations
Baseline comparison to track MCP server improvements
20 GitHub stars
Automated SWE-bench Lite evaluation runner

使用场景

Measuring the effectiveness of new MCP servers on real GitHub issues
Detecting performance regressions during AI agent development
Generating standardized benchmark reports for model-context protocol tools

关于

主要功能

Detailed logging and cost/runtime estimation
Comprehensive results reporting in JSON and Markdown formats
Configurable sample sizing for quick tests or full evaluations
Baseline comparison to track MCP server improvements
20 GitHub stars
Automated SWE-bench Lite evaluation runner

使用场景

Measuring the effectiveness of new MCP servers on real GitHub issues
Detecting performance regressions during AI agent development
Generating standardized benchmark reports for model-context protocol tools