Convert Figma logo to code with AI

pymupdf logoPyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

5,027
483
5,027
32

Top Related Projects

3,342

qpdf: A content-preserving PDF document transformer

2,125

A Python library for reading and writing PDF, powered by QPDF

Community maintained fork of pdfminer - we fathom PDF

8,045

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

Quick Overview

PyMuPDF is a Python binding for MuPDF, a lightweight PDF and XPS viewer. It provides fast access to PDF documents and allows for various operations such as rendering, text extraction, searching, and manipulation of PDF files. PyMuPDF is known for its speed and versatility in handling PDF and other document formats.

Pros

  • High performance and speed in PDF processing
  • Supports multiple document formats (PDF, XPS, EPUB, FB2, CBZ, SVG)
  • Extensive features for PDF manipulation and analysis
  • Active development and regular updates

Cons

  • Complex API for some advanced operations
  • Limited support for creating PDFs from scratch
  • Dependency on the MuPDF library, which may require separate installation
  • Some features may require a commercial license for certain use cases

Code Examples

  1. Opening a PDF and extracting text:
import fitz

doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()
print(text)
  1. Rendering a page as an image:
import fitz

doc = fitz.open("example.pdf")
page = doc[0]
pix = page.get_pixmap()
pix.save("page_image.png")
  1. Searching for text in a PDF:
import fitz

doc = fitz.open("example.pdf")
search_term = "example"
for page in doc:
    text_instances = page.search_for(search_term)
    print(f"Found {len(text_instances)} instances on page {page.number + 1}")
  1. Adding annotations to a PDF:
import fitz

doc = fitz.open("example.pdf")
page = doc[0]
annot = page.add_highlight_annot((100, 100, 200, 120))
annot.set_info(title="Highlight", content="Important section")
doc.save("annotated.pdf")

Getting Started

To get started with PyMuPDF, follow these steps:

  1. Install PyMuPDF using pip:

    pip install PyMuPDF
    
  2. Import the library in your Python script:

    import fitz
    
  3. Open a PDF file and start working with it:

    doc = fitz.open("example.pdf")
    page = doc[0]
    text = page.get_text()
    print(text)
    

For more detailed information and advanced usage, refer to the official documentation at https://pymupdf.readthedocs.io/.

Competitor Comparisons

3,342

qpdf: A content-preserving PDF document transformer

Pros of qpdf

  • Written in C++, offering potentially better performance for low-level PDF operations
  • Provides a command-line interface for quick PDF manipulations
  • Focuses on PDF transformations and repairs, making it more specialized for certain tasks

Cons of qpdf

  • Less comprehensive PDF manipulation capabilities compared to PyMuPDF
  • Requires more setup and integration effort for use in Python projects
  • Limited high-level abstractions for complex PDF operations

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.close()

qpdf (using Python wrapper):

from qpdf import QPDFWriter
qpdf = QPDFWriter()
qpdf.read("input.pdf")
qpdf.write("output.pdf")

Note: The code examples demonstrate basic operations and may not fully represent the capabilities of each library. PyMuPDF offers more high-level PDF manipulation features, while qpdf focuses on lower-level PDF transformations and requires additional steps for text extraction and other operations.

2,125

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

  • Focused on PDF manipulation and editing, offering more specialized PDF-specific features
  • Better support for PDF/A compliance and validation
  • Faster performance for certain PDF operations, especially with large files

Cons of pikepdf

  • Less comprehensive document format support (primarily focused on PDFs)
  • Steeper learning curve for beginners due to its more specialized nature
  • Smaller community and fewer resources compared to PyMuPDF

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.save("output.pdf")

pikepdf:

import pikepdf
pdf = pikepdf.Pdf.open("input.pdf")
page = pdf.pages[0]
text = page.extract_text()
pdf.save("output.pdf")

Both libraries offer similar basic functionality for opening, reading, and saving PDFs. However, pikepdf provides more advanced PDF-specific features, while PyMuPDF offers broader document format support and easier text extraction. PyMuPDF's API is generally more intuitive for beginners, while pikepdf's API is geared towards more advanced PDF manipulation tasks.

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

  • Pure Python implementation, making it easier to install and use across different platforms
  • More focused on text extraction and analysis, providing detailed information about text positioning and formatting
  • Offers more granular control over the PDF parsing process

Cons of pdfminer.six

  • Generally slower performance compared to PyMuPDF, especially for large PDF files
  • Less comprehensive feature set, primarily focused on text extraction rather than full PDF manipulation
  • May require more code to achieve certain tasks, as it provides lower-level access to PDF elements

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
text = doc[0].get_text()

pdfminer.six:

from pdfminer.high_level import extract_text
text = extract_text("example.pdf")

Both libraries offer simple ways to extract text from PDFs, but PyMuPDF's approach is more concise. However, pdfminer.six provides more detailed extraction options when needed, allowing for finer control over the process at the cost of additional code complexity.

8,045

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

  • Pure Python implementation, making it easier to install and use across different platforms
  • Simpler API, suitable for basic PDF operations and manipulation
  • Actively maintained with regular updates and community support

Cons of pypdf

  • Limited functionality compared to PyMuPDF, especially for complex PDF operations
  • Slower performance for large PDF files or intensive operations
  • Less comprehensive documentation and fewer examples available

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.close()

pypdf:

from pypdf import PdfReader
reader = PdfReader("input.pdf")
page = reader.pages[0]
text = page.extract_text()

Both libraries offer similar basic functionality for reading PDF files and extracting text. However, PyMuPDF provides more advanced features and better performance for complex operations, while pypdf offers a simpler API and easier installation process. The choice between the two depends on the specific requirements of your project and the complexity of PDF operations needed.

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

Pros of pdfarranger

  • User-friendly GUI for PDF manipulation tasks
  • Focused on simple operations like reordering, rotating, and merging pages
  • Lightweight and easy to install for end-users

Cons of pdfarranger

  • Limited functionality compared to PyMuPDF's extensive PDF manipulation capabilities
  • Less suitable for programmatic or automated PDF processing tasks
  • Lacks advanced features like text extraction, annotation handling, or PDF creation from scratch

Code comparison

PyMuPDF:

import fitz
doc = fitz.open("input.pdf")
page = doc[0]
text = page.get_text()
doc.save("output.pdf")

pdfarranger:

# pdfarranger is primarily a GUI application
# It doesn't have a direct Python API for scripting
# Users interact with the application through its graphical interface

PyMuPDF offers a comprehensive Python API for PDF manipulation, making it suitable for both scripting and integration into larger applications. pdfarranger, on the other hand, is designed as a standalone GUI application for simple PDF editing tasks, making it more accessible for non-programmers but less flexible for complex operations or automation.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Community

Join us on Discord here: #pymupdf

Installation

PyMuPDF requires Python 3.8 or later, install using pip with:

pip install PyMuPDF

There are no mandatory external dependencies. However, some optional features become available only if additional packages are installed.

You can also try without installing by visiting PyMuPDF.io.

Usage

Basic usage is as follows:

import pymupdf # imports the pymupdf library
doc = pymupdf.open("example.pdf") # open a document
for page in doc: # iterate the document pages
  text = page.get_text() # get plain text encoded as UTF-8

Documentation

Full documentation can be found on pymupdf.readthedocs.io.

Optional Features

  • fontTools for creating font subsets.
  • pymupdf-fonts contains some nice fonts for your text output.
  • Tesseract-OCR for optical character recognition in images and document pages.

About

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF was originally written by Jorj X. McKie.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.