Does it handle complex document processing?

Yes, it provides specific rules for document vision, OCR strategies, and extracting structured data from charts, diagrams, and multi-page PDFs.

Can this skill help with character consistency in AI video?

Yes, it includes advanced multi-shot rules for Kling 3.0 character elements and identity binding to ensure visual consistency across different generated clips.

Which video generation models does this skill support?

The skill includes integration patterns and model selection guides for leading providers including Kling 3.0, Sora 2, Google Veo 3.1, and Runway Gen-4.5.

How does it improve AI audio implementations?

It offers patterns for speech-to-text with diarization, expressive text-to-speech, and low-latency voice agent configurations using models like Gemini Live and Whisper.

Multimodal LLM Integration

Name: Multimodal LLM Integration
Author: yonatangross

byyonatangross

•

116

•

データサイエンスとML

Integrates advanced vision, audio, and video generation capabilities into AI applications using industry-leading multimodal models and production-ready patterns.

The multimodal-llm skill provides a comprehensive framework for implementing vision, audio, and video features within Claude Code. It offers production-ready patterns for image analysis, document OCR, speech processing, and high-end video generation using providers like Kling, Sora, and Runway. Designed for complex AI pipelines, it helps developers navigate model selection, cost optimization, and sophisticated multi-shot video storyboarding while avoiding common pitfalls like improper image resizing or synchronous API polling errors.

主な機能

01Model selection benchmarks for balancing cost, speed, and multimodal accuracy across providers

02Visual analysis and document understanding patterns for OCR and complex chart extraction

03Advanced multi-shot video techniques for maintaining character consistency and identity binding

04116 GitHub stars

05Comprehensive video generation workflows for Kling, Sora, Veo, and Runway APIs

06Speech-to-text (STT) and text-to-speech (TTS) integration with speaker diarization and emotional cues

ユースケース

01Building automated document processing pipelines with visual table extraction and PDF analysis

02Generating professional AI-driven video content with multi-scene storyboarding and character elements

03Developing real-time voice assistants with emotional intelligence and sub-second latency

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit multimodal-llm

For use in Claude.ai and ChatGPT

Download Skill