PyMuPDF icon

PyMuPDF

Createdpymupdf

Enables data extraction, analysis, conversion, and manipulation of PDF, XPS, and eBook documents in Python.

About

PyMuPDF is a high-performance Python library that provides comprehensive capabilities for working with PDF and other document formats. Built on top of MuPDF, it offers functionalities for data extraction, analysis, conversion, and manipulation, making it a versatile tool for various document processing tasks. It supports optional features such as font subsetting with fontTools, enhanced fonts with pymupdf-fonts, and optical character recognition (OCR) via Tesseract.

Key Features

  • Manipulate PDF documents, including merging, splitting, and modifying pages
  • Extract text, images, and metadata from PDF documents
  • 7,177 GitHub stars
  • Create font subsets for text output (with fontTools)
  • Perform optical character recognition (OCR) on images and document pages (with Tesseract)
  • Convert PDF documents to other formats

Use Cases

  • Automated PDF data extraction for data analysis pipelines
  • Document conversion and manipulation workflows
  • Optical Character Recognition (OCR) for scanned documents
Craft Better Prompts with AnyPrompt