Can I use this skill with self-hosted models?

Yes, it supports any OpenAI-compatible API endpoint, making it compatible with models hosted via vLLM, TensorRT-LLM, or custom backend solutions.

What are the infrastructure requirements?

You need Python 3.10+, Docker for local execution, or access to a Slurm/Lepton environment for distributed benchmarking.

What benchmarks are supported by NeMo Evaluator?

It supports over 100 benchmarks including MMLU, GSM8K, HumanEval, IFEval, GPQA, and more from 18+ different specialized evaluation harnesses.

How does this compare to standard evaluation harnesses?

NeMo Evaluator provides a unified configuration layer and containerized environment that simplifies running multiple different harnesses simultaneously while ensuring results are reproducible.

NeMo Evaluator SDK

Name: NeMo Evaluator SDK
Author: eyadsibai

byeyadsibai

0•

データサイエンスとML

Evaluates Large Language Models across 100+ benchmarks using a reproducible, containerized framework.

The NeMo Evaluator SDK skill provides a comprehensive framework for benchmarking LLMs at scale, supporting over 100 standardized tests from 18+ evaluation harnesses like lm-evaluation-harness and HumanEval. It streamlines the evaluation process by offering containerized execution to ensure reproducibility across local Docker environments, Slurm HPC clusters, and cloud backends. This tool is ideal for developers and researchers who need to validate model performance, compare different architectures, and automate regression testing for mathematics, coding, and general instruction-following capabilities.

主な機能

01Unified interface for 18+ evaluation harnesses (simple-evals, bigcode, etc.)

02Multi-target support for NVIDIA NIM, vLLM, and OpenAI-compatible APIs

03Access to 100+ industry-standard benchmarks including MMLU, GSM8K, and HumanEval

04Containerized, reproducible execution across local, Slurm, and Lepton backends

050 GitHub stars

06Direct result exporting to MLflow, Weights & Biases, and local YAML formats

ユースケース

01Scaling massive evaluation pipelines across enterprise Slurm GPU clusters

02Comparing performance delta between different quantization levels and backends

03Benchmarking custom fine-tuned models against industry leaderboards

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add eyadsibai/ltk nemo-evaluator

For use in Claude.ai and ChatGPT