소개
The LLM Serving Patterns skill provides comprehensive guidance on designing, deploying, and scaling Large Language Model (LLM) serving infrastructure. It helps developers navigate critical technical decisions including framework selection among vLLM, TGI, and TensorRT-LLM, while offering deep dives into quantization techniques like AWQ and GPTQ. By implementing advanced patterns such as continuous batching, PagedAttention for KV cache management, and speculative decoding, this skill enables the creation of high-throughput, low-latency AI services capable of handling production-grade workloads efficiently.