What versions of vLLM does this skill support?

This skill is optimized for vLLM 0.14.x, utilizing the latest AttentionConfig APIs and PyTorch 2.9.0 features.

Can I use this for edge or mobile deployment?

Yes, the skill includes specialized patterns for resource-constrained environments to balance model quality with hardware limitations.

How does speculative decoding improve performance?

It accelerates generation by predicting multiple tokens simultaneously using a smaller draft model or n-gram method, typically yielding 1.5-2.5x throughput gains.

Which quantization method should I choose?

The skill recommends AWQ for the best 4-bit quality, FP8 for high-performance H100/H200 hardware, and FP16 when maximum quality and memory are available.

High-Performance Inference

Name: High-Performance Inference
Author: yonatangross

byyonatangross

•

データサイエンスとML

Optimizes Large Language Model inference for production environments using vLLM, advanced quantization, and speculative decoding.

The High-Performance Inference skill provides Claude Code with specialized knowledge to deploy and optimize LLMs using vLLM 0.14.x. It enables developers to maximize hardware utilization and minimize costs through techniques like PagedAttention, tensor parallelism, and various quantization methods including AWQ, GPTQ, and FP8. By implementing speculative decoding and efficient KV cache management, this skill ensures production-ready AI deployments meet strict latency and throughput requirements while maintaining high output quality.

主な機能

01Edge and mobile inference deployment patterns

02Quantization strategies (AWQ, GPTQ, FP8, INT8)

03Speculative decoding for 1.5-2.5x throughput gains

04GPU memory optimization and PagedAttention tuning

0569 GitHub stars

06vLLM 0.14.x production deployment and configuration

ユースケース

01Optimizing Llama 3.1 70B for low-latency production use

02Benchmarking and tuning LLM throughput for batch processing

03Reducing GPU memory footprint via 4-bit quantization

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit high-performance-inference

For use in Claude.ai and ChatGPT

Download Skill