01Support for 15+ benchmarks including HumanEval, MBPP, and APPS
02Instruction-tuned model testing with custom prompt templates and tokens
03Standardized pass@k metric calculation for objective performance measurement
04Multi-language evaluation across 18 different programming languages via MultiPL-E
053,983 GitHub stars
06Secure code execution using Docker containers to prevent host contamination