소개
This skill equips Claude with the expertise to implement NVIDIA's TensorRT-LLM library, enabling high-performance inference serving for production-grade AI applications. It provides guidance on optimizing LLM performance through advanced techniques such as FP8/INT4 quantization, in-flight batching, and Paged KV caching. Whether you are deploying on single H100s or scaling across multi-node GPU clusters, this skill helps configure tensor and pipeline parallelism to achieve up to 100x speedups compared to standard PyTorch implementations, making it essential for high-scale AI research and engineering.