pdf-to-markdown

A PDF to Markdown converter

1,396

251

1,396

View on GitHub

Top Related Projects

OCRmyPDF

29,546

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Parsr

5,971

Transforms PDF, Documents and Images into Enriched Structured Data

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Quick Overview

pdf-to-markdown is a Java-based tool that converts PDF files to Markdown format. It aims to preserve the structure and formatting of the original PDF document while producing clean and readable Markdown output. The project is designed to be used as a command-line tool or integrated into other Java applications.

Pros

Preserves document structure and formatting
Supports both command-line usage and integration into Java applications
Handles various PDF elements like tables, images, and lists
Open-source and actively maintained

Cons

Limited support for complex PDF layouts
May struggle with PDFs containing non-standard fonts or encodings
Requires Java runtime environment
Output quality can vary depending on the input PDF's complexity

Code Examples

Converting a PDF file to Markdown using the command-line interface:

java -jar pdf-to-markdown.jar input.pdf output.md

Using the library in a Java application to convert a PDF file:

import com.github.jzillmann.pdf.PdfToMarkdownConverter;

PdfToMarkdownConverter converter = new PdfToMarkdownConverter();
String markdown = converter.convert("input.pdf");

Customizing conversion options:

import com.github.jzillmann.pdf.PdfToMarkdownConverter;
import com.github.jzillmann.pdf.ConversionOptions;

ConversionOptions options = new ConversionOptions.Builder()
    .setExtractImages(true)
    .setImageOutputDirectory("images/")
    .build();

PdfToMarkdownConverter converter = new PdfToMarkdownConverter(options);
String markdown = converter.convert("input.pdf");

Getting Started

To use pdf-to-markdown in your Java project, add the following dependency to your pom.xml file:

<dependency>
    <groupId>com.github.jzillmann</groupId>
    <artifactId>pdf-to-markdown</artifactId>
    <version>1.0.0</version>
</dependency>

Then, you can use the library in your code as shown in the examples above. For command-line usage, download the latest release JAR file from the GitHub repository and run it using the Java command as demonstrated in the first code example.

Competitor Comparisons

OCRmyPDF

29,546

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pros of OCRmyPDF

Performs OCR on PDF files, making scanned documents searchable
Supports multiple languages and can handle complex layouts
Actively maintained with regular updates and improvements

Cons of OCRmyPDF

Requires more dependencies and system resources
May alter the original PDF layout in some cases
Steeper learning curve for advanced features

Code Comparison

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True, language='eng')

pdf-to-markdown:

const pdfToMarkdown = require('pdf-to-markdown');

pdfToMarkdown('input.pdf').then(markdown => console.log(markdown));

Summary

OCRmyPDF is a powerful tool for making scanned PDFs searchable and editable, while pdf-to-markdown focuses on converting PDF content to Markdown format. OCRmyPDF offers more advanced features and language support but requires more setup. pdf-to-markdown is simpler to use but may not handle complex layouts or scanned documents as effectively. The choice between the two depends on the specific use case and whether OCR functionality is needed.

Parsr

5,971

Transforms PDF, Documents and Images into Enriched Structured Data

Pros of Parsr

More comprehensive document processing capabilities, including table extraction and layout analysis
Supports multiple output formats (markdown, JSON, text, etc.)
Actively maintained with regular updates and improvements

Cons of Parsr

More complex setup and usage, requiring additional dependencies
Heavier resource consumption due to its extensive features
Steeper learning curve for basic PDF to markdown conversion

Code Comparison

pdf-to-markdown

const pdfToMarkdown = require('pdf-to-markdown');

pdfToMarkdown('input.pdf')
  .then(markdown => console.log(markdown));

Parsr

const Parsr = require('parsr');

const config = { ... };
const parsr = new Parsr(config);
parsr.run('input.pdf')
  .then(result => console.log(result.markdown));

While pdf-to-markdown offers a simpler API for basic PDF to markdown conversion, Parsr provides more configuration options and advanced features for document processing. pdf-to-markdown is easier to use for simple tasks, but Parsr offers greater flexibility and functionality for complex document analysis and conversion needs.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

Specialized in extracting tables from PDFs
Offers both a GUI and command-line interface
Supports multiple output formats (CSV, TSV, JSON)

Cons of Tabula

Limited to table extraction, not full PDF-to-Markdown conversion
May struggle with complex or poorly formatted PDFs
Requires Java runtime environment

Code Comparison

pdf-to-markdown

const pdf2md = require('pdf-to-markdown');

pdf2md.convert('input.pdf', 'output.md')
  .then(() => console.log('Conversion complete'))
  .catch(err => console.error('Error:', err));

Tabula

import technology.tabula.CommandLineApp;

public class TabulaExample {
    public static void main(String[] args) {
        CommandLineApp.main(new String[]{"input.pdf", "-o", "output.csv"});
    }
}

Summary

pdf-to-markdown focuses on converting entire PDF documents to Markdown format, while Tabula specializes in extracting tables from PDFs. Tabula offers more flexibility in terms of user interface and output formats for table data, but it's limited to table extraction. pdf-to-markdown provides a more comprehensive solution for converting PDFs to Markdown, including text, images, and basic formatting. The choice between the two depends on the specific use case and whether the primary focus is on table extraction or full document conversion.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More comprehensive PDF parsing capabilities, including text extraction, layout analysis, and metadata extraction
Actively maintained with regular updates and bug fixes
Supports both Python 2 and Python 3

Cons of pdfminer.six

Steeper learning curve due to its more complex API and extensive features
May be overkill for simple PDF-to-markdown conversion tasks
Requires additional processing to convert extracted content to markdown format

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)

pdf-to-markdown:

from pdf_to_markdown import pdf_to_markdown

markdown = pdf_to_markdown('sample.pdf')
print(markdown)

pdfminer.six provides lower-level access to PDF content, requiring additional processing for markdown conversion. pdf-to-markdown offers a simpler API specifically designed for PDF-to-markdown conversion, but with potentially less flexibility for complex PDF parsing tasks.

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

More comprehensive PDF parsing capabilities
Supports a wider range of PDF features and structures
Actively maintained with regular updates

Cons of pdfminer

Steeper learning curve due to its complexity
Requires more setup and configuration for basic tasks
May be overkill for simple PDF-to-markdown conversions

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)

pdf-to-markdown:

PdfToMarkdownConverter converter = new PdfToMarkdownConverter();
String markdown = converter.convert("sample.pdf");
System.out.println(markdown);

Key Differences

pdfminer is a Python library, while pdf-to-markdown is Java-based
pdfminer offers more granular control over PDF parsing
pdf-to-markdown is specifically designed for markdown conversion, making it simpler for this specific use case
pdfminer has a larger community and more extensive documentation

Use Cases

Choose pdfminer for complex PDF processing tasks or when working with Python
Opt for pdf-to-markdown when specifically needing PDF-to-markdown conversion in a Java environment or for simpler projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PDF-To-Markdown Converter

Javascript tool to parse PDF files and convert them into Markdown format. Online version at http://pdf2md.morethan.io!

Major Changes

2020/2021 Currently separating the parsing logic from the frontent in order to make it separately available.
- Branch modularize
- Find the current version at https://jzillmann.github.io/pdf-to-markdown-staging/
- Help me
Apr 2017 - 0.1: Initial Release

Contribute

Use the issue tracker and/or open pull requests!

Useful Build Commands

npm install Download all necessary npm packages
npm run lint Lint the javascript files
npm run test Run tests
npm run check Lint & Test
npm run build Build the dev version
npm run start Run the app on an server (useful for loading of worker.js and cmaps)
npm run watch Continuously build the project
open build/index.html Open the build project in your default browser
npm run release Build production version
npm run deploy Build production version & move it to the github pages fodler

Release

Increase version in package.json
npm run deploy
commit & push
tag with
- git tag -a $releaseVersion -m "$releaseVersion release"
- git push --tags

Credits

pdf.js - Mozilla's PDF parsing & rendering platform which is used as a raw parser

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot