Does this skill support generating images with text?

Yes, by utilizing the gemini-3-pro-image-preview model, the skill provides superior text rendering in generated images, perfect for diagrams and infographics.

Which Google Gemini models are supported by this skill?

This skill supports Gemini 2.0, 2.5 (Pro, Flash, and Flash-lite), and the 3.0 series specifically for high-fidelity image generation and preview tasks.

Can I process YouTube videos directly with this skill?

Yes, the skill supports public YouTube URLs for tasks like scene detection, summarization, and temporal Q&A without needing to download the file manually.

How do I configure my credentials for this skill?

The skill looks for a GEMINI_API_KEY in your environment variables or a .env file. It also supports Vertex AI configurations using your GCP Project ID.

What are the file size and length limits for media processing?

The skill supports audio up to 9.5 hours, video up to 6 hours, and PDFs up to 1,000 pages. Individual files can be up to 2GB through the Google File API.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: flosrn

byflosrn

0•

데이터 과학 및 ML

Integrates Google Gemini's multimodal capabilities to process, analyze, and generate audio, video, images, and documents within Claude Code.

The AI Multimodal Processing skill leverages the Google Gemini API to provide a unified interface for complex multimedia tasks directly from your development environment. It enables high-fidelity audio transcription, deep video analysis with scene detection, structured data extraction from multi-page PDFs, and advanced image generation. By supporting the latest Gemini 2.5 and 3.0 models, it handles context windows up to 2 million tokens, making it ideal for processing long-form media, building automated content pipelines, and performing visual Q&A on complex technical documentation.

주요 기능

01Comprehensive audio transcription and speaker identification for files up to 9.5 hours.

020 GitHub stars

03Advanced video analysis including scene detection, temporal Q&A, and YouTube URL support.

04High-quality text-to-image generation and editing with precise control over aspect ratios and styles.

05Vision-based OCR, object detection, and pixel-level segmentation using Gemini 2.5+ models.

06Intelligent PDF extraction that converts tables, forms, and charts into structured JSON data.

사용 사례

01Automating the extraction of structured data from complex business documents and technical manuals.

02Generating captions, summaries, and searchable metadata for large-scale video and audio libraries.

03Creating technical architecture diagrams and visual assets directly from text-based prompts.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add flosrn/.claude ai-multimodal

For use in Claude.ai and ChatGPT