소개
AI Multimodal provides a unified interface for leveraging Google Gemini's powerful capabilities to analyze and create multimedia content. It enables advanced audio transcription with timestamps, deep image understanding through object detection and OCR, comprehensive video scene analysis, and structured data extraction from complex PDF documents. Beyond analysis, the skill supports text-to-image generation and iterative editing, making it an essential tool for developers and content creators needing to integrate sophisticated multimodal AI features into their workflows with support for context windows up to 2 million tokens.