Does it support local model checkpoints?

Absolutely. You can point the harness to custom local paths for both model weights and tokenizers to evaluate models before they are uploaded to a registry.

Is code execution supported for benchmarks like HumanEval?

Yes, it includes instructions for enabling the '--allow_code_execution' flag, which is required for functional testing of generated code in benchmarks like HumanEval and MBPP.

What benchmarks does this skill support?

It supports over 60 standard academic benchmarks, including MMLU for general knowledge, GSM8K for math, HumanEval for coding, and TruthfulQA for factuality.

Can I use this for faster inference?

Yes, the skill includes a specific workflow for the vLLM backend, which can provide 5-10x faster evaluation compared to standard HuggingFace transformers.

LLM Evaluation Harness

Name: LLM Evaluation Harness
Author: zechenzhangAGI

byzechenzhangAGI

•

384

数据科学与机器学习

Evaluates Large Language Models across 60+ academic benchmarks to measure reasoning, coding, and mathematical capabilities using industry-standard metrics.

关于

The LM Evaluation Harness skill integrates the industry-standard EleutherAI benchmarking framework into your workflow, allowing for rigorous testing of LLMs against datasets like MMLU, GSM8K, and HumanEval. It provides standardized prompts and metrics to ensure reproducible results, whether you are benchmarking a new model release, tracking progress during fine-tuning, or comparing performance across different architectures. By supporting HuggingFace, vLLM, and various APIs, it enables researchers and engineers to generate academic-grade reports and leaderboard-ready data directly through Claude Code.

主要功能

Support for multiple inference backends including HuggingFace and vLLM
384 GitHub stars
Efficient benchmarking with quantization support (4-bit/8-bit) and multi-GPU strategies
Standardized evaluation across 60+ academic tasks (MMLU, GSM8K, HumanEval, etc.)
Automated workflows for tracking training progress and plotting learning curves
Built-in comparison tools to generate markdown performance tables for multiple models

使用场景

Comparing the inference efficiency and accuracy of different model families
Benchmarking a fine-tuned model against baseline performance for academic reporting
Tracking model quality improvements at specific checkpoints during the training loop

关于

主要功能

Support for multiple inference backends including HuggingFace and vLLM
384 GitHub stars
Efficient benchmarking with quantization support (4-bit/8-bit) and multi-GPU strategies
Standardized evaluation across 60+ academic tasks (MMLU, GSM8K, HumanEval, etc.)
Automated workflows for tracking training progress and plotting learning curves
Built-in comparison tools to generate markdown performance tables for multiple models

使用场景

Comparing the inference efficiency and accuracy of different model families
Benchmarking a fine-tuned model against baseline performance for academic reporting
Tracking model quality improvements at specific checkpoints during the training loop