Can this skill handle scanned images or non-selectable text?

Yes, it supports both local OCR using Tesseract and the Mistral OCR API for high-accuracy extraction from scanned documents and images.

Why is Markdown the preferred output format for this skill?

Markdown is preferred for LLM consumption because it preserves semantic markers like headings and lists, which serve as context boundaries for better RAG chunking and document understanding.

Is an API key required for PDF extraction?

No, an API key is not required for local extraction using PyMuPDF or pdfplumber. An API key is only necessary if you opt to use the Mistral OCR service for higher accuracy on complex layouts.

What is the fastest way to extract text from a standard PDF?

For simple, text-based PDFs, PyMuPDF is the fastest option and is recommended as the first step in the provided workflow.

Which tool should I use for PDFs with lots of tables?

The skill recommends using the pdfplumber engine, as it is specifically optimized to recognize and preserve table structures in machine-generated PDFs, converting them into readable Markdown tables.

PDF Text Extraction for LLMs

Name: PDF Text Extraction for LLMs
Author: miwtoo

bymiwtoo

0•

データサイエンスとML

Extracts and converts PDF documents into LLM-friendly formats like Markdown to support RAG pipelines and document analysis.

This skill provides a comprehensive toolkit for transforming complex PDF documents into structured text optimized for language model consumption. It offers a decision-guided workflow to choose between high-speed local extraction using PyMuPDF, table-focused processing with pdfplumber, and high-accuracy OCR for scanned documents via the Mistral API. By prioritizing Markdown output, it ensures that document semantics, such as headings and tables, are preserved for better performance in RAG systems, data analysis tasks, and automated content processing.

主な機能

01LLM-optimized Markdown output for preserving document structure

02Native integration with Mistral OCR API for complex and scanned layouts

03Multi-engine support including PyMuPDF, pdfplumber, and Tesseract

04Specialized handling for tables, math formulas, and multilingual text

05Automated decision guide to select the best extraction tool based on PDF type

060 GitHub stars

ユースケース

01Converting scanned legacy documents into searchable, AI-ready text formats

02Preparing large document sets for RAG (Retrieval-Augmented Generation) pipelines

03Automating data extraction from financial statements and tabular reports

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add miwtoo/credit-card-extraction extracting-pdf-text

For use in Claude.ai and ChatGPT

Download Skill