Does this skill require a separate API key?

Yes, you must provide a GEMINI_API_KEY obtained from Google AI Studio to access the multimodal processing and generation features.

How does this differ from Claude's default vision capabilities?

This skill leverages Google Gemini models which offer enhanced object detection, OCR, and support for much larger context windows (up to 2M tokens) compared to standard models.

What audio and video file formats are supported?

The skill supports MP4, MOV, WAV, MP3, and AAC, with the ability to process videos up to 6 hours and audio up to 9.5 hours.

Can I generate videos using this Claude Code skill?

Yes, it utilizes the Google Veo 3 model to generate 8-second video clips with audio based on your text descriptions.

AI Multimodal Studio

Name: AI Multimodal Studio
Author: bmad-labs

bybmad-labs

•

Data Science & ML

Empowers Claude with Google Gemini's advanced vision, audio, and video processing capabilities for comprehensive multimedia analysis and generation.

This skill integrates the Google Gemini API directly into your Claude Code environment, providing superior image analysis, long-form video understanding, and high-fidelity audio transcription that exceeds standard vision limits. It enables Claude to extract structured data from complex documents, generate high-quality images via Imagen 4, and create short video clips using Veo 3. Designed for developers handling diverse media assets, it includes specialized scripts for batch processing and media optimization, supporting massive context windows up to 2 million tokens for deep temporal and visual reasoning.

Key Features

01Long-form video and audio processing for files up to 9.5 hours with timestamped transcription.

02Text-to-video generation producing 8-second clips with native audio via Veo 3.

03Integrated media optimization tools to compress and format files for seamless API ingestion.

041 GitHub stars

05Advanced vision analysis including OCR, object detection, and complex PDF document extraction.

06High-quality text-to-image generation and editing using Google's Imagen 4 models.

Use Cases

01Extracting structured data and tables from complex multi-page technical diagrams and financial reports.

02Summarizing and analyzing long-form meeting recordings or podcasts with speaker identification.

03Generating marketing assets like product images and social media video clips directly from text prompts.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add bmad-labs/skills ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill