Is it safe to run the code generated by the models?

The harness allows for local execution, but it is highly recommended to use the provided Docker containers to execute generated code in an isolated environment to protect your host system.

What is the pass@k metric?

Pass@k measures the probability that at least one of the top-k code samples generated by a model passes all unit tests for a specific problem, providing a more robust quality metric than greedy decoding.

Does this work with instruction-tuned (chat) models?

Yes, the harness includes specialized tasks like instruct-humaneval and parameters for instruction_tokens to properly format prompts for chat and instruction-following models.

Can I evaluate models in languages other than Python?

Yes, via the MultiPL-E benchmark integration, you can evaluate model performance across 18 different languages including Java, C++, Rust, Go, and JavaScript.

How do I compare my results to the HuggingFace leaderboard?

By using the standard HumanEval and MBPP tasks within this harness, you generate results that are directly comparable to the official BigCode and HuggingFace Open Code LLM Leaderboards.

BigCode Evaluation Harness

Name: BigCode Evaluation Harness
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

データサイエンスとML

Evaluates AI code generation models using industry-standard benchmarks and pass@k metrics.

The BigCode Evaluation Harness is a comprehensive benchmarking suite designed to measure the functional correctness of code generation models. It provides a standardized framework for running 15+ benchmarks—including HumanEval, MBPP, and MultiPL-E—across 18 programming languages. By automating code execution in secure environments and calculating pass@k metrics, it enables researchers and engineers to rigorously compare model performance, test instruction-tuning effectiveness, and validate multi-language capabilities against the same standards used by the HuggingFace leaderboards.

主な機能

01Support for 15+ benchmarks including HumanEval, MBPP, and APPS

02Instruction-tuned model testing with custom prompt templates and tokens

03Standardized pass@k metric calculation for objective performance measurement

04Multi-language evaluation across 18 different programming languages via MultiPL-E

053,983 GitHub stars

06Secure code execution using Docker containers to prevent host contamination

ユースケース

01Measuring the impact of fine-tuning or quantization on a model's functional correctness

02Benchmarking custom-trained code models against industry leaders like StarCoder or CodeLlama

03Validating the multi-language coding proficiency of a new LLM across 18 languages

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills bigcode-evaluation-harness

For use in Claude.ai and ChatGPT

Download Skill