pypdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
Top Related Projects
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Community maintained fork of pdfminer - we fathom PDF
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
A Python library for reading and writing PDF, powered by QPDF
Quick Overview
PyPDF is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF can retrieve text and metadata from PDFs as well.
Pros
- Pure Python implementation, making it easy to install and use across different platforms
- Comprehensive set of features for PDF manipulation and extraction
- Active development and community support
- Well-documented with extensive examples
Cons
- Performance may be slower compared to C-based PDF libraries
- Limited support for complex PDF structures or heavily formatted documents
- May struggle with some non-standard or malformed PDFs
- Text extraction can be imperfect, especially for PDFs with complex layouts
Code Examples
Merging PDFs:
from pypdf import PdfMerger
merger = PdfMerger()
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.write("merged_output.pdf")
merger.close()
Extracting text from a PDF:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)
Adding a watermark to a PDF:
from pypdf import PdfWriter, PdfReader
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
watermark = PdfReader("watermark.pdf")
page.merge_page(watermark.pages[0])
writer.add_page(page)
writer.write("output.pdf")
Getting Started
To get started with PyPDF, first install it using pip:
pip install pypdf
Then, you can import and use it in your Python script:
from pypdf import PdfReader, PdfWriter
# Open a PDF file
reader = PdfReader("input.pdf")
# Get the first page
page = reader.pages[0]
# Create a new PDF with the first page
writer = PdfWriter()
writer.add_page(page)
# Save the new PDF
writer.write("output.pdf")
This example opens a PDF, extracts the first page, and saves it as a new PDF file.
Competitor Comparisons
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Pros of PyMuPDF
- Faster performance for large PDF operations
- More comprehensive feature set, including support for other document formats
- Better handling of complex PDF structures and annotations
Cons of PyMuPDF
- Larger library size and more dependencies
- Steeper learning curve due to more complex API
- Commercial use requires a license
Code Comparison
PyMuPDF example:
import fitz
doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()
PyPDF example:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
Both libraries offer similar basic functionality for PDF manipulation, but PyMuPDF generally provides more advanced features and better performance for complex operations. PyPDF, on the other hand, is lighter-weight and easier to get started with, making it suitable for simpler PDF tasks. The choice between the two depends on the specific requirements of your project, such as performance needs, feature complexity, and licensing considerations.
Community maintained fork of pdfminer - we fathom PDF
Pros of pdfminer.six
- More advanced text extraction capabilities, especially for complex layouts
- Better support for extracting images and other embedded objects
- More granular control over parsing and extraction processes
Cons of pdfminer.six
- Steeper learning curve due to more complex API
- Slower performance compared to pypdf, especially for large documents
- Less frequent updates and maintenance
Code Comparison
pypdf:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
pdfminer.six:
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
While both libraries can extract text from PDFs, pdfminer.six offers more advanced options for customizing the extraction process:
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp
laparams = LAParams()
with open("output.txt", "wb") as output_file:
extract_text_to_fp(open("example.pdf", "rb"), output_file, laparams=laparams)
This example demonstrates pdfminer.six's ability to fine-tune layout analysis parameters and extract text directly to a file, which can be useful for processing large documents or implementing more complex extraction workflows.
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Pros of iText
- More comprehensive PDF manipulation capabilities
- Better performance for large-scale PDF operations
- Robust support for digital signatures and encryption
Cons of iText
- Commercial licensing required for many use cases
- Steeper learning curve due to more complex API
- Java-based, which may not be ideal for Python-centric projects
Code Comparison
iText (Java):
PdfDocument pdf = new PdfDocument(new PdfWriter("output.pdf"));
Document document = new Document(pdf);
document.add(new Paragraph("Hello, World!"));
document.close();
PyPDF (Python):
from pypdf import PdfWriter
writer = PdfWriter()
page = writer.add_blank_page(width=612, height=792)
page.insert_text("Hello, World!")
writer.write("output.pdf")
Summary
iText is a powerful Java-based PDF library with extensive features, suitable for complex PDF manipulations in enterprise environments. It offers better performance and more advanced capabilities but comes with commercial licensing requirements and a steeper learning curve.
PyPDF, on the other hand, is a Python library that provides a simpler API for basic PDF operations. It's open-source and easier to use, making it a good choice for Python developers working on smaller-scale projects or those needing basic PDF functionality.
The choice between the two depends on the specific project requirements, programming language preference, and licensing considerations.
A Python library for reading and writing PDF, powered by QPDF
Pros of pikepdf
- Faster performance, especially for large PDF files
- More comprehensive PDF manipulation capabilities
- Better support for complex PDF structures and features
Cons of pikepdf
- Steeper learning curve due to more complex API
- Requires C++ compiler for installation, which can be challenging on some systems
- Less straightforward for simple PDF operations compared to PyPDF
Code Comparison
PyPDF:
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open("output.pdf", "wb") as f:
writer.write(f)
pikepdf:
import pikepdf
pdf = pikepdf.Pdf.open("input.pdf")
pdf.save("output.pdf")
Both libraries offer PDF manipulation capabilities, but pikepdf generally provides more advanced features and better performance for complex operations. PyPDF is often simpler to use for basic tasks and has an easier installation process. The choice between the two depends on the specific requirements of your project and the complexity of the PDF operations you need to perform.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
pypdf
pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.
See pdfly for a CLI application that uses pypdf to interact with PDFs.
Installation
Install pypdf using pip:
pip install pypdf
For using pypdf with AES encryption or decryption, install extra dependencies:
pip install pypdf[crypto]
NOTE:
pypdf
3.1.0 and above include significant improvements compared to previous versions. Please refer to the migration guide for more information.
Usage
from pypdf import PdfReader
reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
pypdf can do a lot more, e.g. splitting, merging, reading and creating annotations, decrypting and encrypting, and more. Check out the documentation for additional usage examples!
For questions and answers, visit StackOverflow (tagged with pypdf).
Contributions
Maintaining pypdf is a collaborative effort. You can support the project by writing documentation, helping to narrow down issues, and submitting code. See the CONTRIBUTING.md file for more information.
Q&A
The experience pypdf users have covers the whole range from beginners who want to make their live easier to experts who developed software before PDF existed. You can contribute to the pypdf community by answering questions on StackOverflow, helping in discussions, and asking users who report issues for MCVE's (Code + example PDF!).
Issues
A good bug ticket includes a MCVE - a minimal complete verifiable example.
For pypdf, this means that you must upload a PDF that causes the bug to occur
as well as the code you're executing with all of the output. Use
print(pypdf.__version__)
to tell us which version you're using.
Code
All code contributions are welcome, but smaller ones have a better chance to get included in a timely manner. Adding unit tests for new features or test cases for bugs you've fixed help us to ensure that the Pull Request (PR) is fine.
pypdf includes a test suite which can be executed with pytest
:
$ pytest
===================== test session starts =====================
platform linux -- Python 3.6.15, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/moose/GitHub/Martin/pypdf
plugins: cov-3.0.0
collected 233 items
tests/test_basic_features.py .. [ 0%]
tests/test_constants.py . [ 1%]
tests/test_filters.py .................x..... [ 11%]
tests/test_generic.py ................................. [ 25%]
............. [ 30%]
tests/test_javascript.py .. [ 31%]
tests/test_merger.py . [ 32%]
tests/test_page.py ......................... [ 42%]
tests/test_pagerange.py ................ [ 49%]
tests/test_papersizes.py .................. [ 57%]
tests/test_reader.py .................................. [ 72%]
............... [ 78%]
tests/test_utils.py .................... [ 87%]
tests/test_workflows.py .......... [ 91%]
tests/test_writer.py ................. [ 98%]
tests/test_xmp.py ... [100%]
========== 232 passed, 1 xfailed, 1 warning in 4.52s ==========
Top Related Projects
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Community maintained fork of pdfminer - we fathom PDF
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
A Python library for reading and writing PDF, powered by QPDF
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot