Convert Figma logo to code with AI

pdfminer logopdfminer.six

Community maintained fork of pdfminer - we fathom PDF

5,793
919
5,793
234

Top Related Projects

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

8,045

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

2,125

A Python library for reading and writing PDF, powered by QPDF

Quick Overview

PDFMiner.six is a Python library for extracting text, images, and other information from PDF files. It's a fork of the original PDFMiner project, updated to work with both Python 2 and 3. The library provides tools for parsing PDF documents, analyzing their structure, and extracting content.

Pros

  • Supports both Python 2 and 3
  • Offers detailed control over PDF parsing and extraction
  • Can extract various elements including text, images, and metadata
  • Actively maintained with regular updates

Cons

  • Slower performance compared to some other PDF libraries
  • Steeper learning curve for beginners
  • Documentation could be more comprehensive
  • May require additional processing for complex PDF layouts

Code Examples

  1. Extracting text from a PDF:
from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)
  1. Converting PDF to HTML:
from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output_string = StringIO()
with open('sample.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string, laparams=None, output_type='html', codec=None)
print(output_string.getvalue())
  1. Extracting images from a PDF:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTImage

for page_layout in extract_pages("sample.pdf"):
    for element in page_layout:
        if isinstance(element, LTImage):
            print(f"Found image on page {page_layout.pageid}: {element}")

Getting Started

To get started with PDFMiner.six, follow these steps:

  1. Install the library using pip:

    pip install pdfminer.six
    
  2. Import the necessary modules in your Python script:

    from pdfminer.high_level import extract_text, extract_pages
    from pdfminer.layout import LAParams
    
  3. Use the library to extract text from a PDF:

    text = extract_text('path/to/your/pdf/file.pdf', laparams=LAParams())
    print(text)
    

This will extract the text content from the specified PDF file and print it to the console. You can further customize the extraction process by adjusting the LAParams or using other functions provided by the library.

Competitor Comparisons

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

  • Faster performance, especially for large PDF files
  • More comprehensive PDF manipulation capabilities (e.g., editing, merging)
  • Better support for complex PDF structures and annotations

Cons of PyMuPDF

  • Larger library size and more dependencies
  • Steeper learning curve due to more extensive API
  • Commercial use requires a license

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()

pdfminer.six:

from pdfminer.high_level import extract_text
text = extract_text("example.pdf")

PyMuPDF offers more granular control over PDF processing, while pdfminer.six provides a simpler API for basic text extraction. PyMuPDF's approach allows for more advanced operations, but pdfminer.six's simplicity can be advantageous for straightforward tasks.

Both libraries have their strengths, and the choice between them depends on the specific requirements of your project, such as performance needs, desired features, and licensing considerations.

8,045

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

  • Simpler API and easier to use for basic PDF operations
  • Faster performance for common tasks like reading and writing PDFs
  • Better documentation and more active community support

Cons of pypdf

  • Less powerful for complex PDF parsing and analysis
  • Limited support for extracting text with precise formatting
  • Fewer advanced features for handling PDF structure and metadata

Code Comparison

pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")

While both libraries can extract text from PDFs, pypdf offers a more object-oriented approach with separate reader and page objects. pdfminer.six provides a higher-level function for direct text extraction, which can be more convenient for simple use cases but less flexible for complex operations.

2,125

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

  • High-performance PDF manipulation using C++ library QPDF
  • Supports creation, editing, and merging of PDF files
  • Provides low-level access to PDF structures

Cons of pikepdf

  • Steeper learning curve for complex PDF operations
  • Limited text extraction capabilities compared to pdfminer.six
  • Requires compilation of C++ extensions

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

pikepdf:

import pikepdf

pdf = pikepdf.Pdf.open('document.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)

Summary

pdfminer.six excels in text extraction and analysis, making it ideal for content-focused tasks. pikepdf, built on QPDF, offers robust PDF manipulation capabilities, including creation and editing. While pdfminer.six provides easier text extraction, pikepdf offers more comprehensive PDF handling at the cost of a steeper learning curve. The choice between them depends on whether the primary focus is on text extraction (pdfminer.six) or general PDF manipulation (pikepdf).

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pdfminer.six

Continuous integration PyPI version gitter

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • Extract content as text, images, html or hOCR.
  • PDF-1.7 specification support. (well, almost).
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Support for extracting images (JPG, JBIG2, Bitmaps).
  • Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
  • Support for RC4 and AES encryption.
  • Support for AcroForm interactive form extraction.
  • Table of contents extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to use

  • Install Python 3.8 or newer.

  • Install pdfminer.six.

    pip install pdfminer.six
    
    
  • (Optionally) install extra dependencies for extracting images.

    pip install 'pdfminer.six[image]'
    
    
  • Use the command-line interface to extract text from pdf.

    pdf2txt.py example.pdf
    
    
  • Or use it with Python.

    from pdfminer.high_level import extract_text
    
    text = extract_text("example.pdf")
    print(text)
    

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.