Does it require a specific API key?

You need a Google AI Studio or Vertex AI API key, which can be configured via environment variables or .env files.

Can I extract tables from PDFs into structured data?

Absolutely. The skill includes native PDF vision processing to extract tables and forms directly into structured formats like JSON or Markdown.

Does it support image generation?

Yes, it uses specific Gemini models to generate, edit, and refine images from text prompts with controllable aspect ratios.

Which Gemini models are supported by this skill?

It supports the Gemini 2.5 and 2.0 series, including Pro, Flash, and Flash-Lite variants for different performance and cost needs.

Can this skill handle large video files?

Yes, it supports video processing for files up to 6 hours long and includes a media optimizer script to compress files for API limits.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: xiaxingxiaowei1983

byxiaxingxiaowei1983

0•

데이터 과학 및 ML

Integrates Google Gemini API to process, analyze, and generate audio, video, images, and complex documents within Claude.

This skill provides a unified interface for Claude to interact with diverse media formats using the Google Gemini 2.0 and 2.5 model series. It enables advanced capabilities like long-form video analysis for up to 6 hours, multi-page PDF data extraction, audio transcription with timestamps, and high-quality image generation. By automating media optimization and batch processing, it allows developers to build sophisticated multimodal AI features, perform complex visual Q&A, and convert unstructured media into structured data without leaving their development environment.

주요 기능

01Advanced video analysis including scene detection and temporal Q&A for long-form content.

02High-fidelity audio transcription, speaker identification, and environmental sound analysis.

03Native PDF vision processing for extracting tables, forms, and charts from documents up to 1,000 pages.

04Intelligent image understanding featuring object detection, pixel-level segmentation, and OCR.

050 GitHub stars

06Integrated text-to-image generation and iterative editing with support for multiple aspect ratios.

사용 사례

01Creating AI-driven asset management pipelines that automatically tag images and describe visual content.

02Automating the extraction of structured JSON data from complex multi-page financial reports and PDF forms.

03Generating searchable transcripts and summaries for long-form video content and technical webinars.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add xiaxingxiaowei1983/ai ai-multimodal

For use in Claude.ai and ChatGPT

주요 기능

01Advanced video analysis including scene detection and temporal Q&A for long-form content.

02High-fidelity audio transcription, speaker identification, and environmental sound analysis.

03Native PDF vision processing for extracting tables, forms, and charts from documents up to 1,000 pages.

04Intelligent image understanding featuring object detection, pixel-level segmentation, and OCR.

050 GitHub stars

06Integrated text-to-image generation and iterative editing with support for multiple aspect ratios.

사용 사례

01Creating AI-driven asset management pipelines that automatically tag images and describe visual content.

02Automating the extraction of structured JSON data from complex multi-page financial reports and PDF forms.

03Generating searchable transcripts and summaries for long-form video content and technical webinars.