What is the maximum video length supported?

You can process videos up to 6 hours in low-resolution or approximately 2 hours at default quality, leveraging context windows up to 2M tokens.

How is the API key managed?

The skill securely checks for a GEMINI_API_KEY in your process environment, project root .env, or specific .claude configuration directories.

Can I process YouTube videos directly?

Yes, the skill supports public YouTube URLs for analysis, summarization, and timestamped Q&A without needing to download the file manually.

Does it support structured data output?

Yes, it can extract data from visual sources like images and PDFs and return it in structured formats like JSON, CSV, or Markdown.

Which Gemini models does this skill support?

It supports Gemini 2.0 and 2.5 Pro and Flash models, including specialized models for image generation and lightweight processing for segmentation.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: nordeim

bynordeim

0•

Data Science & ML

Integrates Google Gemini's advanced multimodal capabilities to process, analyze, and generate audio, video, images, and documents directly within your development workflow.

AI Multimodal Processing is a comprehensive Claude Code skill that bridges the gap between text-based coding and rich media analysis. By leveraging the Google Gemini 2.0 and 2.5 API, it empowers developers to automate complex tasks like transcribing hours of audio, performing scene detection in videos, extracting structured data from multi-page PDFs, and generating high-quality images from text prompts. This skill provides a unified interface and optimized Python scripts for handling diverse media formats, making it an essential tool for developers building AI-powered features that require deep visual or auditory understanding.

Key Features

01High-fidelity image generation and editing with multiple aspect ratio support.

02Native PDF vision processing for extracting structured data from tables and forms.

03Object detection and pixel-level segmentation using Gemini 2.5 models.

04Advanced video understanding including scene detection and YouTube URL support.

050 GitHub stars

06Comprehensive audio transcription and analysis for files up to 9.5 hours.

Use Cases

01Automating the extraction of structured JSON data from complex financial PDF reports.

02Generating descriptive alt-text and accessibility metadata for large image libraries.

03Transcribing and summarizing long-form video content or meetings for documentation.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add nordeim/prompt-engineering ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill