Convert Figma logo to code with AI

smalot logopdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

2,428
540
2,428
191

Top Related Projects

Community maintained fork of pdfminer - we fathom PDF

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

5,289

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

8,518

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

2,152

A Python library for reading and writing PDF, powered by QPDF

Quick Overview

PDFParser is a PHP library designed to extract data from PDF files. It provides a set of tools to parse PDF documents, extract text, images, and metadata, making it easier to work with PDF content programmatically.

Pros

  • Easy to use and integrate into existing PHP projects
  • Supports extraction of text, images, and metadata from PDF files
  • Actively maintained with regular updates and bug fixes
  • Provides a flexible API for customizing extraction behavior

Cons

  • Limited support for complex PDF layouts or heavily formatted documents
  • May struggle with certain types of encrypted or password-protected PDFs
  • Performance can be slow for large or complex PDF files
  • Requires PHP extension for image extraction (Imagick or GD)

Code Examples

  1. Extracting text from a PDF:
<?php
// Include the autoloader
require_once('vendor/autoload.php');

// Parse PDF file and build necessary objects
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');

// Extract text from PDF
$text = $pdf->getText();
echo $text;
  1. Extracting metadata from a PDF:
<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');

// Get PDF metadata
$metadata = $pdf->getDetails();

// Print specific metadata fields
echo "Title: " . $metadata['Title'] . "\n";
echo "Author: " . $metadata['Author'] . "\n";
echo "Creation Date: " . $metadata['CreationDate'] . "\n";
  1. Extracting images from a PDF:
<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');

// Get all pages
$pages = $pdf->getPages();

foreach ($pages as $page) {
    // Extract images from each page
    $images = $page->getImages();
    foreach ($images as $image) {
        // Save each image
        file_put_contents('image_' . md5($image->getContent()) . '.png', $image->getContent());
    }
}

Getting Started

To use PDFParser in your PHP project, follow these steps:

  1. Install the library using Composer:

    composer require smalot/pdfparser
    
  2. Include the autoloader in your PHP script:

    require_once('vendor/autoload.php');
    
  3. Create a Parser object and start extracting data from your PDF files:

    $parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile('path/to/your/document.pdf');
    $text = $pdf->getText();
    

Now you're ready to use PDFParser to extract data from PDF files in your PHP applications.

Competitor Comparisons

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

  • More comprehensive PDF parsing capabilities, including support for complex layouts and text extraction with positioning
  • Better handling of various PDF encodings and character sets
  • Actively maintained with regular updates and improvements

Cons of pdfminer.six

  • Steeper learning curve due to its more complex API
  • Slower performance compared to pdfparser, especially for large PDF files
  • Requires more setup and configuration for basic use cases

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output_string = StringIO()
with open('example.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string)
text = output_string.getvalue().strip()

pdfparser:

from pdfparser import PDFParser

parser = PDFParser()
pdf = parser.parse('example.pdf')
text = pdf.text

Both libraries offer PDF parsing capabilities, but pdfminer.six provides more advanced features at the cost of complexity, while pdfparser offers a simpler API for basic text extraction. The choice between them depends on the specific requirements of your project, such as the complexity of the PDFs you're working with and the level of detail needed in the extracted information.

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

  • More comprehensive PDF parsing capabilities, including support for complex layouts and text extraction with positional information
  • Better handling of non-Latin character sets and encodings
  • Offers both a command-line interface and a Python API for flexibility

Cons of pdfminer

  • Generally slower performance compared to pdfparser
  • Less actively maintained, with fewer recent updates
  • Steeper learning curve for beginners due to its more complex API

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

pdfparser:

<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo $text;

pdfminer offers a more straightforward approach for basic text extraction in Python, while pdfparser provides a simple PHP interface for parsing PDF files. pdfminer's Python API allows for more granular control over the extraction process, but requires more setup for advanced use cases. pdfparser's PHP implementation is more concise and easier to integrate into existing PHP projects.

5,289

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

  • Faster performance for large PDF files
  • More comprehensive feature set, including PDF editing and creation
  • Better support for complex PDF structures and annotations

Cons of PyMuPDF

  • Steeper learning curve due to more complex API
  • Larger library size and more dependencies

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
for page in doc:
    text = page.get_text()
    print(text)

PDFParser:

from pdfparser import PDFParser, PDFDocument
parser = PDFParser(open("example.pdf", "rb"))
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
for page in doc.get_pages():
    print(page.extract_text())

PyMuPDF offers a more concise and straightforward approach to extracting text from PDF files, while PDFParser requires more setup code. However, PDFParser's approach may provide more flexibility for advanced use cases.

Both libraries are actively maintained and have their strengths. PyMuPDF is generally better suited for complex PDF manipulation tasks and high-performance requirements, while PDFParser may be preferable for simpler text extraction needs or when a lighter-weight solution is desired.

8,518

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

  • Written in pure Python, making it more portable and easier to install
  • Supports both reading and writing PDF files
  • Actively maintained with regular updates and improvements

Cons of pypdf

  • May be slower for large PDF files compared to pdfparser
  • Less comprehensive text extraction capabilities
  • Limited support for complex PDF structures and annotations

Code Comparison

pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

pdfparser:

<?php
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('example.pdf');
$text = $pdf->getText();
echo $text;
?>

Both libraries offer straightforward methods for extracting text from PDF files. pypdf uses a more object-oriented approach with separate reader and page objects, while pdfparser provides a more direct method to extract text from the entire PDF. The main difference lies in the programming language and syntax, with pypdf being Python-based and pdfparser being PHP-based.

2,152

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

  • Faster performance, especially for large PDF files
  • More comprehensive PDF manipulation capabilities, including encryption and digital signatures
  • Better support for PDF/A and other PDF standards

Cons of pikepdf

  • Steeper learning curve due to more complex API
  • Requires C++ compiler for installation, which can be challenging for some users
  • Less focus on text extraction compared to PDFParser

Code Comparison

PDFParser:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();

pikepdf:

import pikepdf

pdf = pikepdf.Pdf.open('document.pdf')
text = ''
for page in pdf.pages:
    text += page.extract_text()

Both libraries offer PDF parsing capabilities, but pikepdf provides more advanced features for PDF manipulation. PDFParser is more straightforward for simple text extraction tasks, while pikepdf offers greater flexibility and performance for complex PDF operations. The choice between the two depends on the specific requirements of your project and the level of PDF manipulation needed.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PDF parser

Version CI CS Scrutinizer Code Quality Downloads

The smalot/pdfparser is a standalone PHP package that provides various tools to extract data from PDF files.

This library is under active maintenance. There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality! See CONTRIBUTING.md for further information about how to contribute.

Features

  • Load/parse objects and headers
  • Extract metadata (author, description, ...)
  • Extract text from ordered pages
  • Support of compressed PDFs
  • Support of MAC OS Roman charset encoding
  • Handling of hexa and octal encoding in text sections
  • Create custom configurations (see CustomConfig.md).

Currently, secured documents and extracting form data are not supported.

License

This library is under the LGPLv3 license.

Install

This library requires PHP 7.1+ since v1. You can install it via Composer:

composer require smalot/pdfparser

In case you can't use Composer, you can include alt_autoload.php-dist. It will include all required files automatically.

Quick example

<?php

// Parse PDF file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('/path/to/document.pdf');

$text = $pdf->getText();
echo $text;

Further usage information can be found here.

Documentation

Documentation can be found in the doc folder.