How does this skill help with large models like Llama-3-70B?

It provides configurations for tensor parallelism and quantization (AWQ/GPTQ) to fit and run large models across multiple GPUs or on hardware with limited VRAM.

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs, utilizing PagedAttention for optimized KV cache management.

Can I use this for batch processing?

Yes, the skill includes specialized workflows for offline batch inference to process large datasets without the overhead of a live API server.

What metrics can I monitor?

The skill guides you in enabling Prometheus metrics to track time-to-first-token (TTFT), request throughput, and GPU cache utilization.

Is it compatible with the OpenAI API?

Yes, vLLM supports a built-in OpenAI-compatible server, allowing you to use existing OpenAI client libraries for your own hosted models.

vLLM Inference Serving

Name: vLLM Inference Serving
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

データサイエンスとML

Serves large language models with high throughput and low latency using PagedAttention and continuous batching.

This skill empowers AI researchers and engineers to deploy production-grade LLM APIs using vLLM, the industry-leading high-performance inference engine. It provides standardized workflows for setting up OpenAI-compatible servers, optimizing GPU memory through quantization (AWQ/GPTQ/FP8), and implementing tensor parallelism for massive models. Whether you are running offline batch inference on large datasets or building real-time chatbots, this skill offers the configurations and best practices needed to maximize hardware utilization and minimize time-to-first-token.

主な機能

01Built-in monitoring with Prometheus metrics and performance tracking

02Multi-GPU acceleration via tensor parallelism for large-scale models

03High-throughput inference with PagedAttention and continuous batching

04Seamless deployment of OpenAI-compatible API endpoints

053,983 GitHub stars

06Advanced quantization support including AWQ, GPTQ, and FP8

ユースケース

01Deploying a production-grade LLM API for multi-user chat applications

02Running high-speed offline batch processing on massive text datasets

03Optimizing large model inference on limited GPU hardware using quantization

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills vllm

For use in Claude.ai and ChatGPT

Download Skill