How do I initialize a configuration file?

You can run the 'mcpbr init' command via Claude to generate a template YAML configuration file, which is required before running any evaluation.

Which benchmarks are supported by this skill?

It supports SWE-bench (bug fixes), CyberGym (security vulnerabilities and exploits), and MCPToolBench++ (large-scale tool use evaluation).

Do I need Docker to run these benchmarks?

Yes, Docker is mandatory for running evaluations as mcpbr uses it to isolate and manage the benchmarking environments for reproducibility.

Can I run evaluations on specific tasks only?

Yes, the skill allows you to target specific task IDs using the -t flag, enabling you to debug or re-run specific benchmark instances.

What is the primary purpose of the mcpbr-eval skill?

It automates the evaluation of Model Context Protocol (MCP) servers using the mcpbr CLI to test AI agents against standardized benchmarks like SWE-bench.

MCP Benchmark Runner

Name: MCP Benchmark Runner
Author: greynewell

bygreynewell

•

安全与测试

Evaluates and benchmarks Model Context Protocol (MCP) servers using standardized datasets like SWE-bench and CyberGym.

The MCP Benchmark Runner skill enables Claude to perform rigorous, automated evaluations of MCP servers by interfacing with the mcpbr CLI. It streamlines the benchmarking process for AI agents across diverse datasets, including SWE-bench for real-world bug fixes, CyberGym for security vulnerability exploits, and MCPToolBench++ for tool-use proficiency. By managing environment prerequisites, validating YAML configurations, and generating detailed Markdown reports, this skill provides a standardized framework for measuring the reliability and performance of AI tools and agentic workflows.

主要功能

01Template generation and validation for mcpbr configuration files

02Detailed result exporting in JSON and Markdown report formats

03Automated prerequisite validation for Docker and API environment variables

04Support for granular testing including sample sizing and specific task selection

0520 GitHub stars

06Standardized benchmarking across SWE-bench, CyberGym, and MCPToolBench++

使用场景

01Measuring the success rate of AI agents in fixing real-world software bugs

02Comparing performance metrics across different AI models for specific toolsets

03Evaluating the security performance of models against simulated cyber vulnerabilities

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add greynewell/mcpbr mcpbr-eval

For use in Claude.ai and ChatGPT

Download Skill