关于
The LM Evaluation Harness skill integrates the industry-standard EleutherAI benchmarking framework into your workflow, allowing for rigorous testing of LLMs against datasets like MMLU, GSM8K, and HumanEval. It provides standardized prompts and metrics to ensure reproducible results, whether you are benchmarking a new model release, tracking progress during fine-tuning, or comparing performance across different architectures. By supporting HuggingFace, vLLM, and various APIs, it enables researchers and engineers to generate academic-grade reports and leaderboard-ready data directly through Claude Code.