Does this skill require a Gemini API Key?

Yes, you must provide a GEMINI_API_KEY via environment variables or a .env file to access the processing capabilities.

Can I generate images with text using this skill?

Yes, the Gemini 3.0 series models included in this skill are specifically optimized for rendering text within generated images and diagrams.

Is there a cost associated with using this skill?

While the skill itself is open-source, usage of the Gemini API is subject to Google’s pricing and rate limits for their AI models.

Does it support processing PDF documents?

Absolutely. It features native PDF vision processing for up to 1,000 pages, capable of extracting tables, charts, and structured data.

What is the maximum video length Claude can process with this skill?

The skill supports video processing for up to 6 hours at low resolution or 2 hours at default resolution using Gemini 2.5 models.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: Devattom

byDevattom

0•

データサイエンスとML

Enables Claude to analyze, transcribe, and generate multimedia content including audio, images, videos, and documents through the Gemini API.

This skill integrates advanced multimodal capabilities into the Claude Code environment, allowing users to process a wide array of media formats using Google’s Gemini 2.x and 3.x models. It excels at complex tasks such as transcribing long-form audio, performing temporal video analysis, extracting structured data from multi-page PDFs, and generating high-quality images with text labels. Whether you need to automate video summarization, perform OCR on thousands of screenshots, or create architectural diagrams from text prompts, this skill provides a unified interface and optimized scripts for handling large-scale media processing directly within your development workflow.

主な機能

01High-fidelity image generation and editing with support for multiple aspect ratios.

02Advanced object detection and pixel-level segmentation using Gemini 2.5+ models.

030 GitHub stars

04Comprehensive audio transcription and analysis for files up to 9.5 hours.

05Temporal video understanding and scene detection with YouTube URL support.

06Native PDF vision processing for structured data extraction from forms and tables.

ユースケース

01Generating UI mockups or architectural diagrams directly from text descriptions during the design phase.

02Automated video content summarization and timestamped transcription for technical tutorials.

03Extracting structured JSON data from complex multi-page financial or technical PDF documents.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add devattom/.claude ai-multimodal

For use in Claude.ai and ChatGPT