What file formats does this skill support?

It supports a wide range of formats including MP4 and MOV for video; WAV, MP3, and AAC for audio; PNG, JPEG, and WEBP for images; and PDF for native document vision processing.

What is the maximum video length Claude can process with this skill?

Using Gemini 2.5 models, this skill can process up to 6 hours of low-resolution video or approximately 2 hours at default resolution.

How does it handle large documents like multi-page PDFs?

It uses native PDF vision processing to extract tables, forms, and diagrams from documents up to 1,000 pages long, providing structured JSON output.

Can this skill generate images from text?

Yes, it includes image generation capabilities using the Gemini 2.5 Flash-Image model, supporting various aspect ratios and iterative editing.

Do I need a Google Gemini API key to use this?

Yes, you need a Gemini API key from Google AI Studio or Vertex AI. The skill supports configuration via environment variables or .env files.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: GGPrompts

byGGPrompts

•

数据科学与机器学习

Processes and generates audio, video, images, and documents using the Google Gemini API to enable advanced media understanding and creation within Claude.

The AI Multimodal Processing skill integrates Google Gemini's advanced capabilities into Claude, allowing for sophisticated analysis of multimedia assets. It provides a unified interface for transcribing hours of audio, performing scene detection on long-form video, extracting structured data from multi-page PDFs, and generating high-quality images from text. This skill is particularly useful for developers who need to automate complex media workflows, perform high-fidelity OCR, or implement pixel-level image segmentation and object detection directly within their development environment.

主要功能

01High-accuracy audio transcription with speaker identification and timestamps.

02Comprehensive vision capabilities including object detection and pixel-level segmentation.

03Deep video analysis and scene detection for files up to 6 hours long.

04Advanced document extraction for tables, forms, and charts from multi-page PDFs.

05Text-to-image generation and editing with support for multiple aspect ratios.

062 GitHub stars

使用场景

01Building automated content tagging systems for large-scale image and video libraries.

02Automating data extraction and summarization from large collections of meeting recordings and legal documents.

03Generating and refining visual assets or UI components using iterative text-to-image prompts.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ggprompts/my-gg-plugins ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill