How does this skill help reduce AI deployment costs?

It provides patterns for quantization (reducing memory footprint) and PagedAttention (maximizing throughput), allowing you to serve more users on less expensive hardware.

What frameworks are supported by the LLM Serving Patterns skill?

This skill provides guidance on a wide range of frameworks including vLLM, Text Generation Inference (TGI), TensorRT-LLM, Triton Inference Server, Ollama, and llama.cpp.

Can this skill help with latency issues?

Absolutely. It covers low-latency optimization techniques such as speculative decoding, continuous batching, and hardware-specific optimizations for NVIDIA GPUs.

Does it cover streaming for real-time chat?

Yes, it includes detailed patterns for implementing both Server-Sent Events (SSE) and WebSocket streaming for interactive LLM applications.

LLM Serving Patterns

Name: LLM Serving Patterns
Author: melodic-software

bymelodic-software

•

データサイエンスとML

Optimizes LLM inference infrastructure and deployment strategies using industry-standard frameworks and architectural patterns.

The LLM Serving Patterns skill provides comprehensive guidance on designing, deploying, and scaling Large Language Model (LLM) serving infrastructure. It helps developers navigate critical technical decisions including framework selection among vLLM, TGI, and TensorRT-LLM, while offering deep dives into quantization techniques like AWQ and GPTQ. By implementing advanced patterns such as continuous batching, PagedAttention for KV cache management, and speculative decoding, this skill enables the creation of high-throughput, low-latency AI services capable of handling production-grade workloads efficiently.

主な機能

01Framework benchmarking and selection (vLLM, TGI, TensorRT-LLM)

02Advanced quantization strategies including INT4, INT8, AWQ, and GPTQ

03Continuous batching and throughput optimization patterns

04Streaming response implementation using SSE and WebSockets

0512 GitHub stars

06Memory management techniques like PagedAttention and KV caching

ユースケース

01Reducing GPU memory requirements for model deployment via quantization

02Optimizing real-time chat applications for minimal time-to-first-token (TTFT)

03Architecting a high-concurrency inference backend for enterprise AI agents

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add melodic-software/claude-code-plugins llm-serving-patterns

For use in Claude.ai and ChatGPT

Download Skill