概要
This skill provides a unified interface for advanced multimedia understanding and generation by leveraging Google Gemini's powerful multimodal capabilities. It enables developers to transcribe long-form audio, perform deep video analysis with temporal understanding, extract structured data from complex PDF documents, and generate high-quality images from text prompts. Designed for versatility, it supports massive context windows of up to 2 million tokens, making it ideal for processing hours of footage or thousands of document pages with high precision and automated optimization scripts.