Which Gemini models are supported?

It supports the Gemini 2.0 and 2.5 series, including Flash for speed, Pro for high-quality reasoning, and Flash-Image for generation.

What is the maximum file size supported?

The skill supports files up to 2GB via the Gemini File API, with a 20GB project quota and 48-hour storage retention.

Can I process YouTube videos directly?

Yes, the skill supports public YouTube URLs for video analysis and temporal understanding tasks.

Does this skill require a Google Gemini API key?

Yes, you need a valid API key from Google AI Studio or Vertex AI configured in your environment variables to use this skill.

Does it support OCR for multi-page documents?

Yes, it features native PDF vision processing that can handle up to 1,000 pages for text extraction, table parsing, and diagram analysis.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: wollfoo

bywollfoo

0•

Ciencia de Datos y ML

Integrates Google Gemini's multimodal capabilities to process audio, video, images, and documents directly within the Claude Code environment.

This skill provides a unified interface for leveraging Google Gemini 2.0 and 2.5 models to analyze and generate multimedia content. It enables Claude to perform complex tasks such as transcribing long-form audio (up to 9.5 hours), detecting objects in videos, extracting structured data from multi-page PDFs, and generating high-quality images from text prompts. By supporting large context windows of up to 2M tokens and providing specialized scripts for media optimization, this tool is essential for developers building multimodal AI features or requiring deep analysis of non-text assets within their development workflow.

Características Principales

01High-quality image generation and editing using specialized Gemini models

02Long-form media support with up to 2M token context windows

03Advanced OCR and structured data extraction from complex forms and tables

04Automated media optimization and batch processing utility scripts

05Multimodal analysis for audio, video, images, and PDF documents

060 GitHub stars

Casos de Uso

01Automating data entry by extracting JSON from scanned PDF documents and charts

02Transcribing and summarizing long technical meetings or video tutorials for documentation

03Building and testing AI-driven vision or image generation features in applications

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add wollfoo/amp ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill