What are the maximum media length limits?

You can process audio files up to 9.5 hours and video files up to 6 hours (at low resolution) or 2 hours (at default resolution).

Can I extract data from scanned PDF documents?

Yes, the skill uses native PDF vision processing to extract text, tables, and charts from documents up to 1,000 pages.

Does this skill require a Google API key?

Yes, you must provide a GEMINI_API_KEY from Google AI Studio or configure Vertex AI credentials to use these multimodal features.

How does the image generation feature work?

It utilizes the Gemini 2.5 flash-image model to generate images from text prompts, allowing for controllable styles, aspect ratios, and iterative refinements.

Which file formats are supported by this skill?

The skill supports a wide range of formats including MP3, WAV, AAC (audio), PNG, JPEG, WEBP (images), MP4, MOV, AVI (video), and PDF (documents).

AI Multimodal Processing

Name: AI Multimodal Processing
Author: Activer007

byActiver007

0•

Data Science & ML

Processes, analyzes, and generates multimedia content including audio, video, images, and complex documents using the Google Gemini API.

This skill integrates Google Gemini's advanced multimodal capabilities directly into Claude, enabling deep analysis of audio files up to 9.5 hours, video processing up to 6 hours, and complex data extraction from multi-page PDFs. It provides a unified interface for diverse tasks such as timestamped transcription, object detection, visual Q&A, and high-fidelity text-to-image generation. Whether you are automating data entry from scanned forms or building automated video summarization pipelines, this skill provides the necessary patterns and scripts to handle sophisticated media processing workflows.

Key Features

01Comprehensive audio transcription and speaker identification for files up to 9.5 hours

02Intelligent image analysis featuring object detection, OCR, and pixel-level segmentation

03Native PDF vision processing for extracting tables, forms, and charts into structured JSON

04Advanced video understanding including scene detection, temporal Q&A, and YouTube support

05High-quality text-to-image generation with support for multiple aspect ratios and iterative editing

060 GitHub stars

Use Cases

01Extracting structured JSON data from complex multi-page financial reports and PDF documents

02Automating visual asset creation and image refinement for frontend development workflows

03Generating searchable summaries and timestamped transcriptions for long-form video archives

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add activer007/ordinary-claude-skills ai-multimodal

For use in Claude.ai and ChatGPT

Key Features

01Comprehensive audio transcription and speaker identification for files up to 9.5 hours

02Intelligent image analysis featuring object detection, OCR, and pixel-level segmentation

03Native PDF vision processing for extracting tables, forms, and charts into structured JSON

04Advanced video understanding including scene detection, temporal Q&A, and YouTube support

05High-quality text-to-image generation with support for multiple aspect ratios and iterative editing

060 GitHub stars

Use Cases

01Extracting structured JSON data from complex multi-page financial reports and PDF documents

02Automating visual asset creation and image refinement for frontend development workflows

03Generating searchable summaries and timestamped transcriptions for long-form video archives