Can it extract data from tables within a PDF?

Yes, it features native PDF vision processing specifically designed to extract tables, forms, and charts into structured formats like JSON or Markdown.

Does this skill require a separate API key?

Yes, you need a Google Gemini API key from Google AI Studio or credentials for a Vertex AI project on Google Cloud Platform.

Which file formats are supported by this skill?

The skill supports a broad range of formats including MP3, WAV, AAC (Audio); PNG, JPEG, WEBP (Images); MP4, MOV, AVI (Video); and native PDF processing for documents.

Can I use it to generate images from text?

Absolutely. The skill includes image generation capabilities using text-to-image models, supporting various aspect ratios and iterative refinements.

What is the maximum video length I can process?

Using Gemini 2.5 models with a 2M token context window, you can process up to 6 hours of low-resolution video or approximately 2 hours at default resolution.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: Microck

byMicrock

•

Data Science & ML

Processes and generates audio, video, images, and complex documents using Google Gemini's advanced multimodal API capabilities.

This skill provides a unified interface for Claude to interact with Google Gemini's multimodal models, enabling deep analysis of multimedia content within a coding workflow. It allows users to perform high-fidelity audio transcription, analyze videos up to 6 hours long, extract structured data from complex multi-page PDFs, and generate or edit high-quality images. By bridging the gap between raw media files and actionable text-based insights, it is an essential tool for developers building media-intensive applications, automating document workflows, or requiring sophisticated visual and auditory understanding.

Key Features

0181 GitHub stars

02High-resolution image generation and pixel-level segmentation for visual editing.

03Native PDF vision processing for accurate table, form, and chart extraction.

04Advanced video analysis with scene detection, temporal Q&A, and YouTube URL support.

05Support for Gemini 2.5/2.0 models with massive 2M token context windows.

06Comprehensive audio processing including transcription, speaker ID, and text-to-speech.

Use Cases

01Extracting structured JSON data from batches of complex financial or legal PDF documents.

02Prototyping and refining visual assets directly through text-to-image generation and iterative editing.

03Generating automated summaries and timestamped transcripts for long-form video meetings or technical webinars.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add microck/ordinary-claude-skills ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill