Does it support local model checkpoints?

Absolutely. You can point the harness to custom local paths for both model weights and tokenizers to evaluate models before they are uploaded to a registry.

Is code execution supported for benchmarks like HumanEval?

Yes, it includes instructions for enabling the '--allow_code_execution' flag, which is required for functional testing of generated code in benchmarks like HumanEval and MBPP.

What benchmarks does this skill support?

It supports over 60 standard academic benchmarks, including MMLU for general knowledge, GSM8K for math, HumanEval for coding, and TruthfulQA for factuality.

Can I use this for faster inference?

Yes, the skill includes a specific workflow for the vLLM backend, which can provide 5-10x faster evaluation compared to standard HuggingFace transformers.

LLM Evaluation Harness

Name: LLM Evaluation Harness
Author: zechenzhangAGI

byzechenzhangAGI

•

384

•

데이터 과학 및 ML

Evaluates Large Language Models across 60+ academic benchmarks to measure reasoning, coding, and mathematical capabilities using industry-standard metrics.

The LM Evaluation Harness skill integrates the industry-standard EleutherAI benchmarking framework into your workflow, allowing for rigorous testing of LLMs against datasets like MMLU, GSM8K, and HumanEval. It provides standardized prompts and metrics to ensure reproducible results, whether you are benchmarking a new model release, tracking progress during fine-tuning, or comparing performance across different architectures. By supporting HuggingFace, vLLM, and various APIs, it enables researchers and engineers to generate academic-grade reports and leaderboard-ready data directly through Claude Code.

주요 기능

01Support for multiple inference backends including HuggingFace and vLLM

02384 GitHub stars

03Efficient benchmarking with quantization support (4-bit/8-bit) and multi-GPU strategies

04Standardized evaluation across 60+ academic tasks (MMLU, GSM8K, HumanEval, etc.)

05Automated workflows for tracking training progress and plotting learning curves

06Built-in comparison tools to generate markdown performance tables for multiple models

사용 사례

01Comparing the inference efficiency and accuracy of different model families

02Benchmarking a fine-tuned model against baseline performance for academic reporting

03Tracking model quality improvements at specific checkpoints during the training loop

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add zechenzhangagi/ai-research-skills lm-evaluation-harness

For use in Claude.ai and ChatGPT

Download Skill