Does this skill require a separate API key?

Yes, you must provide a Google Gemini API key from Google AI Studio or Vertex AI via your environment variables.

How long can the processed video files be?

The skill can handle up to 6 hours of low-resolution video or approximately 2 hours of standard-resolution video.

Can it process large PDF documents?

Yes, the skill supports native PDF vision processing for documents up to 1,000 pages long, including table and chart analysis.

What file formats are supported for audio and video?

It supports common formats including MP3, WAV, AAC for audio, and MP4, MOV, AVI, and WebM for video, as well as YouTube URLs.

What models are compatible with this skill?

It supports the Gemini 2.0 and 2.5 series, including Flash, Pro, and specialized image-generation models.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: Activer007

byActiver007

0•

데이터 과학 및 ML

Processes and generates multimedia content including audio, video, images, and documents using the Google Gemini API.

This skill empowers Claude with comprehensive multimodal capabilities by leveraging Google's Gemini API (2.0 and 2.5 series). It enables advanced analysis of audio files up to 9.5 hours, video processing for up to 6 hours, and complex PDF vision extraction for documents up to 1,000 pages. Beyond analysis, it provides text-to-image generation and refinement, making it an all-in-one solution for developers needing to automate media transcription, extract structured data from visual documents, or integrate AI-driven image creation directly into their coding workflows.

주요 기능

01High-fidelity PDF extraction for tables, forms, and charts

020 GitHub stars

03Comprehensive audio transcription with timestamps and speaker ID

04Advanced vision tasks including object detection and segmentation

05Long-form video analysis and scene detection up to 6 hours

06Text-to-image generation and iterative image refinement

사용 사례

01Generating summaries and searchable transcripts for long-form meetings or video content

02Automating the extraction of structured data from complex multi-page PDF documents

03Creating and editing visual assets using natural language prompts within the dev environment

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add activer007/ordinary-claude-skills ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill