Does it support image generation from text prompts?

Yes, using the Gemini 2.5 Flash-Image model, you can generate, edit, and refine images with support for various aspect ratios.

Which Gemini models are compatible with this skill?

The skill supports the Gemini 2.5 and 2.0 series, including Pro, Flash, and Flash-Lite models, to balance quality and processing speed.

How do I configure the API credentials for this skill?

The skill looks for a GEMINI_API_KEY in your environment variables or within .env files located in your project or .claude/skills directory.

Can this skill extract data from complex PDF documents?

Yes, it features native PDF vision processing for up to 1,000 pages, allowing for the extraction of tables, forms, and diagrams into structured formats like JSON.

What is the maximum video length supported for analysis?

You can process videos up to 6 hours in length for tasks like scene detection, summarization, and temporal Q&A.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: piotroq

bypiotroq

0•

Data Science & ML

Processes and generates multimedia content including audio, video, images, and documents via the Google Gemini API.

This skill provides a unified interface for Claude to leverage Google Gemini's advanced multimodal capabilities directly within your workflow. It enables complex tasks such as long-form video analysis (up to 6 hours), multi-page PDF extraction, audio transcription with speaker identification, and high-quality image generation. By integrating Gemini 2.0 and 2.5 models, it allows developers to automate media-intensive tasks, extract structured data from visual sources, and generate creative assets using a set of optimized scripts and command-line patterns.

Key Features

01Text-to-image generation and iterative editing with controllable aspect ratios

020 GitHub stars

03Advanced image understanding including OCR, object detection, and pixel-level segmentation

04Native PDF processing for up to 1,000 pages with structured table and form extraction

05High-fidelity audio transcription and analysis for files up to 9.5 hours

06Long-form video processing with scene detection and temporal Q&A capabilities

Use Cases

01Generating descriptive metadata, captions, and object labels for large-scale image and video libraries

02Automating the extraction of structured JSON data from complex multi-page PDF forms and charts

03Transcribing and summarizing long technical meetings or video tutorials with precise timestamps

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add piotroq/adwokat-trzebnica-html-to-php-com ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill