What models are supported for media generation?

It supports the Imagen 4.0 family for high-quality text-to-image generation and Veo 3.1 for generating 8-second video clips with native audio.

How does this skill differ from Claude's built-in vision?

This skill leverages Google Gemini models which provide specialized capabilities like temporal video analysis, object detection, and support for massive files up to 2 million tokens that exceed default limitations.

What are the setup requirements?

You need a Google Gemini API key from AI Studio, and the skill requires Python libraries like google-genai, pillow, and python-dotenv.

Can it handle long audio and video files?

Yes, it includes scripts to process audio up to 9.5 hours and video up to 6 hours, including automated chunking to ensure full transcripts without API truncation.

AI Multimodal Content & Analysis

Name: AI Multimodal Content & Analysis
Author: CongDon1207

byCongDon1207

•

数据科学与机器学习

Integrates Google Gemini's advanced multimodal capabilities to process, analyze, and generate professional audio, image, and video content directly within your workflow.

This skill empowers Claude with the Google Gemini API suite, enabling superior image reasoning, long-form audio transcription, video scene analysis, and high-fidelity media generation. It bypasses standard vision limitations by providing access to specialized models like Imagen 4 for text-to-image and Veo 3 for text-to-video. Whether you are extracting structured data from complex PDFs, generating marketing assets from text prompts, or performing temporal analysis on hours of video footage, this tool provides the scripts and reference patterns needed to handle massive context windows up to 2 million tokens efficiently with built-in API key rotation and media optimization.

主要功能

01Professional audio transcription with speaker timestamps and music/sound analysis.

02Advanced OCR and structured data extraction from multi-page PDF documents and forms.

03Smart API key rotation and media optimization for high-volume batch processing.

04High-fidelity image and video analysis using Gemini 2.5/3 models with 2M token context.

05Text-to-image and text-to-video generation using Imagen 4 and Veo 3.

061 GitHub stars

使用场景

01Generating high-quality visual assets and 8-second video clips for production from simple text prompts.

02Analyzing complex technical diagrams and extracting structured JSON data from multi-page business reports.

03Transcribing and summarizing long-form podcasts or meeting recordings with accurate timestamps and metadata.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add congdon1207/agents.md ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill