Does this skill require a separate API key?

Yes, you need a Google AI Studio or Vertex AI API key configured as GEMINI_API_KEY in your environment variables or .env file.

Which Gemini models are supported?

The skill supports Gemini 2.0 and 2.5 series, including Flash, Pro, and Flash-Lite models to balance speed and accuracy.

What media formats are supported?

It supports major audio (MP3, WAV, AAC), image (PNG, JPEG, WEBP), video (MP4, MOV, AVI), and PDF formats for vision processing.

Can it process YouTube videos?

Yes, it supports analysis of public YouTube URLs, providing temporal understanding, scene detection, and summarization.

Is there a limit on file size or length?

The skill supports files up to 2GB via the Gemini File API, with specific length limits such as 9.5 hours for audio and 6 hours for video.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: GGPrompts

byGGPrompts

•

데이터 과학 및 ML

Integrates Google Gemini's multimodal capabilities to process audio, video, images, and documents directly within your development workflow.

This skill empowers Claude to interact with complex multimedia assets by leveraging the Google Gemini API (including versions 1.5, 2.0, and 2.5). It provides a unified interface for sophisticated tasks such as long-form video analysis, timestamped audio transcription, high-accuracy OCR, and native PDF vision processing. Beyond analysis, it supports image generation and editing, making it an essential tool for developers building AI-driven features or needing to extract structured data from diverse media formats within their terminal environment.

주요 기능

01Visual understanding including object detection, pixel-level segmentation, and OCR

02Structured data extraction from complex PDF documents, tables, and charts

03Comprehensive audio transcription and analysis for files up to 9.5 hours

04Integrated text-to-image generation and refinement using Gemini models

05Advanced video processing with scene detection and YouTube URL support

061 GitHub stars

사용 사례

01Automating the extraction of structured JSON data from multi-page PDF reports and invoices

02Analyzing long-form video content to create automated summaries or time-stamped logs

03Generating code or documentation based on visual screenshots and UI design mockups

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ggprompts/my-plugins ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill