What is the maximum video length supported?

The skill supports processing videos up to 6 hours in low resolution or approximately 2 hours in default resolution using the Gemini 2.5 series models.

Can I generate images with this skill?

Yes, it supports text-to-image generation, image editing, and composition using specific Gemini image-generation models.

Which document formats are supported for vision processing?

The skill is optimized for native PDF vision processing (up to 1,000 pages), but also includes scripts to convert DOCX, XLSX, and PPTX to PDF for analysis.

How does the skill handle large media files?

It includes a media_optimizer utility that compresses video and audio files or resizes images to meet API size limits while maintaining analysis quality.

Does this skill require a separate API key?

Yes, you need a Google Gemini API key from Google AI Studio or Vertex AI, which must be configured in your environment variables or .env file.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: rafaelcalleja

byrafaelcalleja

•

데이터 과학 및 ML

Processes, analyzes, and generates audio, video, image, and document content using Google Gemini's powerful multimodal API.

This skill integrates Google Gemini’s industry-leading multimodal capabilities into Claude Code, allowing users to analyze large-scale media assets including video files up to 6 hours, audio recordings up to 9.5 hours, and complex PDF documents up to 1,000 pages. It provides a unified interface for sophisticated tasks such as automated transcription with speaker identification, object detection, pixel-level segmentation, and high-fidelity image generation. By automating media optimization and structured data extraction, it serves as a comprehensive toolset for developers building AI-driven media processing pipelines or extracting insights from diverse file formats directly within their development environment.

주요 기능

01Unified processing for audio, video, images, and PDF documents

02Native support for YouTube URLs and long-form video analysis

03Structured data extraction from complex tables, forms, and charts

04High-fidelity text-to-image generation and iterative editing

05Automated transcription with timestamps and speaker identification

061 GitHub stars

사용 사례

01Automating video summarization and scene-based temporal analysis

02Extracting structured JSON data from multi-page PDF reports and forms

03Generating and refining marketing assets or UI mockups via text prompts

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add rafaelcalleja/claude-market-place ai-multimodal

For use in Claude.ai and ChatGPT

주요 기능

01Unified processing for audio, video, images, and PDF documents

02Native support for YouTube URLs and long-form video analysis

03Structured data extraction from complex tables, forms, and charts

04High-fidelity text-to-image generation and iterative editing

05Automated transcription with timestamps and speaker identification

061 GitHub stars

사용 사례

01Automating video summarization and scene-based temporal analysis

02Extracting structured JSON data from multi-page PDF reports and forms

03Generating and refining marketing assets or UI mockups via text prompts