PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

7,041

597

7,041

View on GitHub

Top Related Projects

qpdf

3,934

qpdf: A content-preserving PDF document transformer

pikepdf

2,329

A Python library for reading and writing PDF, powered by QPDF

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

pypdf

9,187

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

pdfarranger

4,325

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

Quick Overview

PyMuPDF is a Python binding for MuPDF, a lightweight PDF and XPS viewer. It provides fast access to PDF documents and allows for various operations such as rendering, text extraction, searching, and manipulation of PDF files. PyMuPDF is known for its speed and versatility in handling PDF and other document formats.

Pros

High performance and speed in PDF processing
Supports multiple document formats (PDF, XPS, EPUB, FB2, CBZ, SVG)
Extensive features for PDF manipulation and analysis
Active development and regular updates

Cons

Complex API for some advanced operations
Limited support for creating PDFs from scratch
Dependency on the MuPDF library, which may require separate installation
Some features may require a commercial license for certain use cases

Code Examples

Opening a PDF and extracting text:

import fitz

doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()
print(text)

Rendering a page as an image:

import fitz

doc = fitz.open("example.pdf")
page = doc[0]
pix = page.get_pixmap()
pix.save("page_image.png")

Searching for text in a PDF:

import fitz

doc = fitz.open("example.pdf")
search_term = "example"
for page in doc:
    text_instances = page.search_for(search_term)
    print(f"Found {len(text_instances)} instances on page {page.number + 1}")

Adding annotations to a PDF:

import fitz

doc = fitz.open("example.pdf")
page = doc[0]
annot = page.add_highlight_annot((100, 100, 200, 120))
annot.set_info(title="Highlight", content="Important section")
doc.save("annotated.pdf")

Getting Started

To get started with PyMuPDF, follow these steps:

Install PyMuPDF using pip:
```
pip install PyMuPDF
```
Import the library in your Python script:
```
import fitz
```

Open a PDF file and start working with it:

doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()
print(text)

For more detailed information and advanced usage, refer to the official documentation at https://pymupdf.readthedocs.io/.

Competitor Comparisons

qpdf

3,934

qpdf: A content-preserving PDF document transformer

Pros of qpdf

Written in C++, offering potentially better performance for low-level PDF operations
Provides a command-line interface for quick PDF manipulations
Focuses on PDF transformations and repairs, making it more specialized for certain tasks

Cons of qpdf

Less comprehensive PDF manipulation capabilities compared to PyMuPDF
Requires more setup and integration effort for use in Python projects
Limited high-level abstractions for complex PDF operations

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.close()

qpdf (using Python wrapper):

from qpdf import QPDFWriter
qpdf = QPDFWriter()
qpdf.read("input.pdf")
qpdf.write("output.pdf")

Note: The code examples demonstrate basic operations and may not fully represent the capabilities of each library. PyMuPDF offers more high-level PDF manipulation features, while qpdf focuses on lower-level PDF transformations and requires additional steps for text extraction and other operations.

pikepdf

2,329

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

Focused on PDF manipulation and editing, offering more specialized PDF-specific features
Better support for PDF/A compliance and validation
Faster performance for certain PDF operations, especially with large files

Cons of pikepdf

Less comprehensive document format support (primarily focused on PDFs)
Steeper learning curve for beginners due to its more specialized nature
Smaller community and fewer resources compared to PyMuPDF

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.save("output.pdf")

pikepdf:

import pikepdf
pdf = pikepdf.Pdf.open("input.pdf")
page = pdf.pages[0]
text = page.extract_text()
pdf.save("output.pdf")

Both libraries offer similar basic functionality for opening, reading, and saving PDFs. However, pikepdf provides more advanced PDF-specific features, while PyMuPDF offers broader document format support and easier text extraction. PyMuPDF's API is generally more intuitive for beginners, while pikepdf's API is geared towards more advanced PDF manipulation tasks.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

Pure Python implementation, making it easier to install and use across different platforms
More focused on text extraction and analysis, providing detailed information about text positioning and formatting
Offers more granular control over the PDF parsing process

Cons of pdfminer.six

Generally slower performance compared to PyMuPDF, especially for large PDF files
Less comprehensive feature set, primarily focused on text extraction rather than full PDF manipulation
May require more code to achieve certain tasks, as it provides lower-level access to PDF elements

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
text = doc[0].get_text()

pdfminer.six:

from pdfminer.high_level import extract_text
text = extract_text("example.pdf")

Both libraries offer simple ways to extract text from PDFs, but PyMuPDF's approach is more concise. However, pdfminer.six provides more detailed extraction options when needed, allowing for finer control over the process at the cost of additional code complexity.

pypdf

9,187

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

Pure Python implementation, making it easier to install and use across different platforms
Simpler API, suitable for basic PDF operations and manipulation
Actively maintained with regular updates and community support

Cons of pypdf

Limited functionality compared to PyMuPDF, especially for complex PDF operations
Slower performance for large PDF files or intensive operations
Less comprehensive documentation and fewer examples available

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.close()

pypdf:

from pypdf import PdfReader
reader = PdfReader("input.pdf")
page = reader.pages[0]
text = page.extract_text()

Both libraries offer similar basic functionality for reading PDF files and extracting text. However, PyMuPDF provides more advanced features and better performance for complex operations, while pypdf offers a simpler API and easier installation process. The choice between the two depends on the specific requirements of your project and the complexity of PDF operations needed.

pdfarranger

4,325

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

Pros of pdfarranger

User-friendly GUI for PDF manipulation tasks
Focused on simple operations like reordering, rotating, and merging pages
Lightweight and easy to install for end-users

Cons of pdfarranger

Limited functionality compared to PyMuPDF's extensive PDF manipulation capabilities
Less suitable for programmatic or automated PDF processing tasks
Lacks advanced features like text extraction, annotation handling, or PDF creation from scratch

Code comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.save("output.pdf")

pdfarranger:

# pdfarranger is primarily a GUI application
# It doesn't have a direct Python API for scripting
# Users interact with the application through its graphical interface

PyMuPDF offers a comprehensive Python API for PDF manipulation, making it suitable for both scripting and integration into larger applications. pdfarranger, on the other hand, is designed as a standalone GUI application for simple PDF editing tasks, making it more accessible for non-programmers but less flexible for complex operations or automation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Community

Join us on Discord here: #pymupdf

Installation

PyMuPDF requires Python 3.9 or later, install using pip with:

pip install PyMuPDF

There are no mandatory external dependencies. However, some optional features become available only if additional packages are installed.

You can also try without installing by visiting PyMuPDF.io.

Usage

Basic usage is as follows:

import pymupdf # imports the pymupdf library
doc = pymupdf.open("example.pdf") # open a document
for page in doc: # iterate the document pages
  text = page.get_text() # get plain text encoded as UTF-8

Documentation

Full documentation can be found on pymupdf.readthedocs.io.

Optional Features

fontTools for creating font subsets.
pymupdf-fonts contains some nice fonts for your text output.
Tesseract-OCR for optical character recognition in images and document pages.

About

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF was originally written by Jorj X. McKie.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot