Why should I merge LoRA weights before production serving?

Production engines like vLLM and SGLang require merged weights to maintain high throughput and performance, as loading separate adapters adds overhead.

What serving engines are supported?

The skill supports native Python inference, vLLM, SGLang, and GGUF-based tools like Ollama or llama.cpp.

How does Unsloth speed up native inference?

Unsloth uses specialized Triton kernels and on-the-fly weight fusion to double native inference speed through the for_inference() function.

Does this skill support low-VRAM environments?

Yes, it provides specific workflows for 4-bit merging and exporting models to GGUF format for efficient serving on limited hardware.

Can I replace my OpenAI API with an Unsloth model?

Yes, the skill guides you through using the unsloth-cli to create an OpenAI-compatible endpoint that works as a drop-in replacement for standard API clients.

Unsloth Model Inference

Name: Unsloth Model Inference
Author: cuba6112

bycuba6112

0•

数据科学与机器学习

Deploys and optimizes fine-tuned LLMs using native Unsloth kernels, vLLM, or SGLang for high-performance production serving.

The unsloth-inference skill provides specialized guidance for moving fine-tuned models from training into production environments. It automates the selection of optimized inference paths, including 2x faster native execution via specialized Triton kernels and production-grade serving through engines like vLLM and SGLang. By guiding users through weight merging strategies (16-bit or 4-bit) and OpenAI-compatible API setup, this skill ensures that Claude can efficiently help developers deploy high-throughput, low-latency AI endpoints with minimal manual configuration.

主要功能

010 GitHub stars

02Native kernel optimization for 2x faster local inference

03Automated LoRA weight merging for 16-bit and 4-bit formats

04Production-grade serving workflows for vLLM and SGLang

05GGUF export support for low-VRAM deployment via Ollama

06OpenAI-compatible API endpoint configuration and testing

使用场景

01Setting up drop-in OpenAI-compatible endpoints for legacy application integration

02Scaling fine-tuned models for high-concurrency production applications

03Converting LoRA adapters into merged standalone models for faster execution

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add cuba6112/skillfactory unsloth-inference

For use in Claude.ai and ChatGPT

Download Skill