Acerca de
The LLM Inference skill provides comprehensive guidance for serving large language models (LLMs) by comparing and configuring top-tier engines like vLLM, llama.cpp, and TGI. It helps users select the ideal deployment strategy based on hardware availability—ranging from high-performance NVIDIA GPUs to resource-constrained CPUs and Apple Silicon. With detailed insights into GGUF quantization, memory requirements, and advanced techniques like PagedAttention and continuous batching, this skill enables developers to maximize throughput and minimize latency in both production and local development environments.