Can I speed up generation without using a smaller draft model?

Yes, this skill supports n-gram speculative decoding, which predicts future tokens based on previous sequences without the need for a separate, secondary model.

Which quantization method is best for NVIDIA H100 GPUs?

For H100 and H200 GPUs, FP8 quantization is highly recommended as it provides a significant speed boost (30-50%) while maintaining very high model quality.

What is the benefit of AWQ over GPTQ?

AWQ (Activation-aware Weight Quantization) typically offers better model quality at 4-bit precision because it protects important weights based on the activation distribution during calibration.

How does vLLM improve inference performance?

vLLM uses PagedAttention to manage KV cache memory more efficiently, allowing for up to 24x higher throughput and continuous batching of requests to maximize GPU utilization.

High-Performance LLM Inference

Name: High-Performance LLM Inference
Author: yonatangross

byyonatangross

•

Data Science & ML

Optimizes Large Language Model inference for production environments using vLLM, advanced quantization, and speculative decoding techniques.

This skill provides production-ready patterns for deploying Large Language Models with maximum efficiency using vLLM 0.14.x. It enables developers to implement high-throughput features like PagedAttention and continuous batching while reducing hardware costs through diverse quantization methods such as AWQ and FP8. By integrating speculative decoding and tensor parallelism, the skill helps solve common production challenges like high latency, excessive GPU memory usage, and the need for scalable edge deployment.

Key Features

01Quantization implementation for AWQ, GPTQ, INT8, and FP8 formats

02Speculative decoding setup using draft models or n-gram lookups

0369 GitHub stars

04Memory optimization via PagedAttention and continuous batching

05Automated performance benchmarking for throughput and latency analysis

06Production-grade vLLM 0.14.x deployment and configuration patterns

Use Cases

01Accelerating real-time chat response times via n-gram based speculative decoding

02Deploying large-scale models like Llama 3.1 across multi-GPU setups using tensor parallelism

03Reducing VRAM requirements by 75% for edge deployment using 4-bit AWQ quantization

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit high-performance-inference

For use in Claude.ai and ChatGPT

Download Skill