PyMuPDF
Enables data extraction, analysis, conversion, and manipulation of PDF, XPS, and eBook documents in Python.
소개
PyMuPDF is a high-performance Python library that provides comprehensive capabilities for working with PDF and other document formats. Built on top of MuPDF, it offers functionalities for data extraction, analysis, conversion, and manipulation, making it a versatile tool for various document processing tasks. It supports optional features such as font subsetting with fontTools, enhanced fonts with pymupdf-fonts, and optical character recognition (OCR) via Tesseract.
주요 기능
- Manipulate PDF documents, including merging, splitting, and modifying pages
- Extract text, images, and metadata from PDF documents
- 7,177 GitHub stars
- Create font subsets for text output (with fontTools)
- Perform optical character recognition (OCR) on images and document pages (with Tesseract)
- Convert PDF documents to other formats
사용 사례
- Automated PDF data extraction for data analysis pipelines
- Document conversion and manipulation workflows
- Optical Character Recognition (OCR) for scanned documents