pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

8,045

1,389

8,045

View on GitHub

Top Related Projects

PyMuPDF

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pdfminer.six

5,793

Community maintained fork of pdfminer - we fathom PDF

iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.

pikepdf

2,125

A Python library for reading and writing PDF, powered by QPDF

Quick Overview

PyPDF is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF can retrieve text and metadata from PDFs as well.

Pros

Pure Python implementation, making it easy to install and use across different platforms
Comprehensive set of features for PDF manipulation and extraction
Active development and community support
Well-documented with extensive examples

Cons

Performance may be slower compared to C-based PDF libraries
Limited support for complex PDF structures or heavily formatted documents
May struggle with some non-standard or malformed PDFs
Text extraction can be imperfect, especially for PDFs with complex layouts

Code Examples

Merging PDFs:

from pypdf import PdfMerger

merger = PdfMerger()
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.write("merged_output.pdf")
merger.close()

Extracting text from a PDF:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

Adding a watermark to a PDF:

from pypdf import PdfWriter, PdfReader

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
watermark = PdfReader("watermark.pdf")
page.merge_page(watermark.pages[0])
writer.add_page(page)

writer.write("output.pdf")

Getting Started

To get started with PyPDF, first install it using pip:

pip install pypdf

Then, you can import and use it in your Python script:

from pypdf import PdfReader, PdfWriter

# Open a PDF file
reader = PdfReader("input.pdf")

# Get the first page
page = reader.pages[0]

# Create a new PDF with the first page
writer = PdfWriter()
writer.add_page(page)

# Save the new PDF
writer.write("output.pdf")

This example opens a PDF, extracts the first page, and saves it as a new PDF file.

Competitor Comparisons

PyMuPDF

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

Faster performance for large PDF operations
More comprehensive feature set, including support for other document formats
Better handling of complex PDF structures and annotations

Cons of PyMuPDF

Larger library size and more dependencies
Steeper learning curve due to more complex API
Commercial use requires a license

Code Comparison

PyMuPDF example:

import fitz
doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()

PyPDF example:

from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()

Both libraries offer similar basic functionality for PDF manipulation, but PyMuPDF generally provides more advanced features and better performance for complex operations. PyPDF, on the other hand, is lighter-weight and easier to get started with, making it suitable for simpler PDF tasks. The choice between the two depends on the specific requirements of your project, such as performance needs, feature complexity, and licensing considerations.

pdfminer.six

5,793

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More advanced text extraction capabilities, especially for complex layouts
Better support for extracting images and other embedded objects
More granular control over parsing and extraction processes

Cons of pdfminer.six

Steeper learning curve due to more complex API
Slower performance compared to pypdf, especially for large documents
Less frequent updates and maintenance

Code Comparison

pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")

While both libraries can extract text from PDFs, pdfminer.six offers more advanced options for customizing the extraction process:

from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp

laparams = LAParams()
with open("output.txt", "wb") as output_file:
    extract_text_to_fp(open("example.pdf", "rb"), output_file, laparams=laparams)

This example demonstrates pdfminer.six's ability to fine-tune layout analysis parameters and extract text directly to a file, which can be useful for processing large documents or implementing more complex extraction workflows.

itext-java

1,968

Pros of iText

More comprehensive PDF manipulation capabilities
Better performance for large-scale PDF operations
Robust support for digital signatures and encryption

Cons of iText

Commercial licensing required for many use cases
Steeper learning curve due to more complex API
Java-based, which may not be ideal for Python-centric projects

Code Comparison

iText (Java):

PdfDocument pdf = new PdfDocument(new PdfWriter("output.pdf"));
Document document = new Document(pdf);
document.add(new Paragraph("Hello, World!"));
document.close();

PyPDF (Python):

from pypdf import PdfWriter

writer = PdfWriter()
page = writer.add_blank_page(width=612, height=792)
page.insert_text("Hello, World!")
writer.write("output.pdf")

Summary

iText is a powerful Java-based PDF library with extensive features, suitable for complex PDF manipulations in enterprise environments. It offers better performance and more advanced capabilities but comes with commercial licensing requirements and a steeper learning curve.

PyPDF, on the other hand, is a Python library that provides a simpler API for basic PDF operations. It's open-source and easier to use, making it a good choice for Python developers working on smaller-scale projects or those needing basic PDF functionality.

The choice between the two depends on the specific project requirements, programming language preference, and licensing considerations.

pikepdf

2,125

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

Faster performance, especially for large PDF files
More comprehensive PDF manipulation capabilities
Better support for complex PDF structures and features

Cons of pikepdf

Steeper learning curve due to more complex API
Requires C++ compiler for installation, which can be challenging on some systems
Less straightforward for simple PDF operations compared to PyPDF

Code Comparison

PyPDF:

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

with open("output.pdf", "wb") as f:
    writer.write(f)

pikepdf:

import pikepdf

pdf = pikepdf.Pdf.open("input.pdf")
pdf.save("output.pdf")

Both libraries offer PDF manipulation capabilities, but pikepdf generally provides more advanced features and better performance for complex operations. PyPDF is often simpler to use for basic tasks and has an easier installation process. The choice between the two depends on the specific requirements of your project and the complexity of the PDF operations you need to perform.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pypdf

pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

See pdfly for a CLI application that uses pypdf to interact with PDFs.

Installation

Install pypdf using pip:

pip install pypdf

For using pypdf with AES encryption or decryption, install extra dependencies:

pip install pypdf[crypto]

NOTE: pypdf 3.1.0 and above include significant improvements compared to previous versions. Please refer to the migration guide for more information.

Usage

from pypdf import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

pypdf can do a lot more, e.g. splitting, merging, reading and creating annotations, decrypting and encrypting, and more. Check out the documentation for additional usage examples!

For questions and answers, visit StackOverflow (tagged with pypdf).

Contributions

Maintaining pypdf is a collaborative effort. You can support the project by writing documentation, helping to narrow down issues, and submitting code. See the CONTRIBUTING.md file for more information.

Q&A

The experience pypdf users have covers the whole range from beginners who want to make their live easier to experts who developed software before PDF existed. You can contribute to the pypdf community by answering questions on StackOverflow, helping in discussions, and asking users who report issues for MCVE's (Code + example PDF!).

Issues

A good bug ticket includes a MCVE - a minimal complete verifiable example. For pypdf, this means that you must upload a PDF that causes the bug to occur as well as the code you're executing with all of the output. Use print(pypdf.__version__) to tell us which version you're using.

Code

All code contributions are welcome, but smaller ones have a better chance to get included in a timely manner. Adding unit tests for new features or test cases for bugs you've fixed help us to ensure that the Pull Request (PR) is fine.

pypdf includes a test suite which can be executed with pytest:

$ pytest
===================== test session starts =====================
platform linux -- Python 3.6.15, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/moose/GitHub/Martin/pypdf
plugins: cov-3.0.0
collected 233 items

tests/test_basic_features.py ..                         [  0%]
tests/test_constants.py .                               [  1%]
tests/test_filters.py .................x.....           [ 11%]
tests/test_generic.py ................................. [ 25%]
.............                                           [ 30%]
tests/test_javascript.py ..                             [ 31%]
tests/test_merger.py .                                  [ 32%]
tests/test_page.py .........................            [ 42%]
tests/test_pagerange.py ................                [ 49%]
tests/test_papersizes.py ..................             [ 57%]
tests/test_reader.py .................................. [ 72%]
...............                                         [ 78%]
tests/test_utils.py ....................                [ 87%]
tests/test_workflows.py ..........                      [ 91%]
tests/test_writer.py .................                  [ 98%]
tests/test_xmp.py ...                                   [100%]

========== 232 passed, 1 xfailed, 1 warning in 4.52s ==========

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot