Provides a comprehensive benchmark for evaluating general-purpose agents on tasks requiring interaction with various real-world services through a vast collection of task-specific tools.
The MCP Company introduces a novel benchmark designed to rigorously evaluate the capabilities of tool-calling agents within complex, real-world environments. Leveraging the Model Context Protocol (MCP), it constructs servers from REST APIs of various services, offering an extensive collection of over 18,000 task-specific tools. This platform also includes manually annotated ground-truth tools for each task, enabling the assessment of agent performance with both ideal and retrieved tool sets. The benchmark reveals current limitations of advanced reasoning models in navigating and combining tens of thousands of tools, highlighting the need for improved reasoning and retrieval mechanisms in enterprise-scale environments.