pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

2,560

560

2,560

201

View on GitHub

Top Related Projects

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PyMuPDF

7,041

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pypdf

9,187

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

pikepdf

2,329

A Python library for reading and writing PDF, powered by QPDF

Quick Overview

PDFParser is a PHP library designed to extract data from PDF files. It provides a set of tools to parse PDF documents, extract text, images, and metadata, making it easier to work with PDF content programmatically.

Pros

Easy to use and integrate into existing PHP projects
Supports extraction of text, images, and metadata from PDF files
Actively maintained with regular updates and bug fixes
Provides a flexible API for customizing extraction behavior

Cons

Limited support for complex PDF layouts or heavily formatted documents
May struggle with certain types of encrypted or password-protected PDFs
Performance can be slow for large or complex PDF files
Requires PHP extension for image extraction (Imagick or GD)

Code Examples

Extracting text from a PDF:

<?php
// Include the autoloader
require_once('vendor/autoload.php');

// Parse PDF file and build necessary objects
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');

// Extract text from PDF
$text = $pdf->getText();
echo $text;

Extracting metadata from a PDF:

<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');

// Get PDF metadata
$metadata = $pdf->getDetails();

// Print specific metadata fields
echo "Title: " . $metadata['Title'] . "\n";
echo "Author: " . $metadata['Author'] . "\n";
echo "Creation Date: " . $metadata['CreationDate'] . "\n";

Extracting images from a PDF:

<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');

// Get all pages
$pages = $pdf->getPages();

foreach ($pages as $page) {
    // Extract images from each page
    $images = $page->getImages();
    foreach ($images as $image) {
        // Save each image
        file_put_contents('image_' . md5($image->getContent()) . '.png', $image->getContent());
    }
}

Getting Started

To use PDFParser in your PHP project, follow these steps:

Install the library using Composer:
```
composer require smalot/pdfparser
```
Include the autoloader in your PHP script:
```
require_once('vendor/autoload.php');
```

Create a Parser object and start extracting data from your PDF files:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('path/to/your/document.pdf');
$text = $pdf->getText();

Now you're ready to use PDFParser to extract data from PDF files in your PHP applications.

Competitor Comparisons

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More comprehensive PDF parsing capabilities, including support for complex layouts and text extraction with positioning
Better handling of various PDF encodings and character sets
Actively maintained with regular updates and improvements

Cons of pdfminer.six

Steeper learning curve due to its more complex API
Slower performance compared to pdfparser, especially for large PDF files
Requires more setup and configuration for basic use cases

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output_string = StringIO()
with open('example.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string)
text = output_string.getvalue().strip()

pdfparser:

from pdfparser import PDFParser

parser = PDFParser()
pdf = parser.parse('example.pdf')
text = pdf.text

Both libraries offer PDF parsing capabilities, but pdfminer.six provides more advanced features at the cost of complexity, while pdfparser offers a simpler API for basic text extraction. The choice between them depends on the specific requirements of your project, such as the complexity of the PDFs you're working with and the level of detail needed in the extracted information.

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

More comprehensive PDF parsing capabilities, including support for complex layouts and text extraction with positional information
Better handling of non-Latin character sets and encodings
Offers both a command-line interface and a Python API for flexibility

Cons of pdfminer

Generally slower performance compared to pdfparser
Less actively maintained, with fewer recent updates
Steeper learning curve for beginners due to its more complex API

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

pdfparser:

<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo $text;

pdfminer offers a more straightforward approach for basic text extraction in Python, while pdfparser provides a simple PHP interface for parsing PDF files. pdfminer's Python API allows for more granular control over the extraction process, but requires more setup for advanced use cases. pdfparser's PHP implementation is more concise and easier to integrate into existing PHP projects.

PyMuPDF

7,041

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

Faster performance for large PDF files
More comprehensive feature set, including PDF editing and creation
Better support for complex PDF structures and annotations

Cons of PyMuPDF

Steeper learning curve due to more complex API
Larger library size and more dependencies

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
for page in doc:
    text = page.get_text()
    print(text)

PDFParser:

from pdfparser import PDFParser, PDFDocument
parser = PDFParser(open("example.pdf", "rb"))
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
for page in doc.get_pages():
    print(page.extract_text())

PyMuPDF offers a more concise and straightforward approach to extracting text from PDF files, while PDFParser requires more setup code. However, PDFParser's approach may provide more flexibility for advanced use cases.

Both libraries are actively maintained and have their strengths. PyMuPDF is generally better suited for complex PDF manipulation tasks and high-performance requirements, while PDFParser may be preferable for simpler text extraction needs or when a lighter-weight solution is desired.

pypdf

9,187

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

Written in pure Python, making it more portable and easier to install
Supports both reading and writing PDF files
Actively maintained with regular updates and improvements

Cons of pypdf

May be slower for large PDF files compared to pdfparser
Less comprehensive text extraction capabilities
Limited support for complex PDF structures and annotations

Code Comparison

pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

pdfparser:

<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('example.pdf');
$text = $pdf->getText();
echo $text;
?>

Both libraries offer straightforward methods for extracting text from PDF files. pypdf uses a more object-oriented approach with separate reader and page objects, while pdfparser provides a more direct method to extract text from the entire PDF. The main difference lies in the programming language and syntax, with pypdf being Python-based and pdfparser being PHP-based.

pikepdf

2,329

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

Faster performance, especially for large PDF files
More comprehensive PDF manipulation capabilities, including encryption and digital signatures
Better support for PDF/A and other PDF standards

Cons of pikepdf

Steeper learning curve due to more complex API
Requires C++ compiler for installation, which can be challenging for some users
Less focus on text extraction compared to PDFParser

Code Comparison

PDFParser:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();

pikepdf:

import pikepdf

pdf = pikepdf.Pdf.open('document.pdf')
text = ''
for page in pdf.pages:
    text += page.extract_text()

Both libraries offer PDF parsing capabilities, but pikepdf provides more advanced features for PDF manipulation. PDFParser is more straightforward for simple text extraction tasks, while pikepdf offers greater flexibility and performance for complex PDF operations. The choice between the two depends on the specific requirements of your project and the level of PDF manipulation needed.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PDF parser

The smalot/pdfparser is a standalone PHP package that provides various tools to extract data from PDF files.

This library is under active maintenance. There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality! See CONTRIBUTING.md for further information about how to contribute.

Features

Load/parse objects and headers
Extract metadata (author, description, ...)
Extract text from ordered pages
Support of compressed PDFs
Support of MAC OS Roman charset encoding
Handling of hexa and octal encoding in text sections
Create custom configurations (see CustomConfig.md).

Currently, secured documents and extracting form data are not supported.

License

This library is under the LGPLv3 license.

Install

This library requires PHP 7.1+ since v1. You can install it via Composer:

composer require smalot/pdfparser

In case you can't use Composer, you can include alt_autoload.php-dist. It will include all required files automatically.

Quick example

<?php

// Parse PDF file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('/path/to/document.pdf');

$text = $pdf->getText();
echo $text;

Further usage information can be found here.

Documentation

Documentation can be found in the doc folder.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot