소개
This skill empowers developers to deploy production-ready LLM APIs by leveraging vLLM's state-of-the-art inference engine. It provides specialized guidance on optimizing memory usage through PagedAttention, implementing continuous batching for high-concurrency environments, and utilizing quantization techniques like AWQ and GPTQ to fit large models on limited hardware. Whether you are building an OpenAI-compatible service or performing massive offline batch processing, this skill streamlines the configuration of tensor parallelism, monitoring with Prometheus, and Docker-based deployments for peak performance.