Top Related Projects
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Transforms PDF, Documents and Images into Enriched Structured Data
Tabula is a tool for liberating data tables trapped inside PDF files
Community maintained fork of pdfminer - we fathom PDF
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Quick Overview
pdf-to-markdown is a Java-based tool that converts PDF files to Markdown format. It aims to preserve the structure and formatting of the original PDF document while producing clean and readable Markdown output. The project is designed to be used as a command-line tool or integrated into other Java applications.
Pros
- Preserves document structure and formatting
- Supports both command-line usage and integration into Java applications
- Handles various PDF elements like tables, images, and lists
- Open-source and actively maintained
Cons
- Limited support for complex PDF layouts
- May struggle with PDFs containing non-standard fonts or encodings
- Requires Java runtime environment
- Output quality can vary depending on the input PDF's complexity
Code Examples
- Converting a PDF file to Markdown using the command-line interface:
java -jar pdf-to-markdown.jar input.pdf output.md
- Using the library in a Java application to convert a PDF file:
import com.github.jzillmann.pdf.PdfToMarkdownConverter;
PdfToMarkdownConverter converter = new PdfToMarkdownConverter();
String markdown = converter.convert("input.pdf");
- Customizing conversion options:
import com.github.jzillmann.pdf.PdfToMarkdownConverter;
import com.github.jzillmann.pdf.ConversionOptions;
ConversionOptions options = new ConversionOptions.Builder()
.setExtractImages(true)
.setImageOutputDirectory("images/")
.build();
PdfToMarkdownConverter converter = new PdfToMarkdownConverter(options);
String markdown = converter.convert("input.pdf");
Getting Started
To use pdf-to-markdown in your Java project, add the following dependency to your pom.xml
file:
<dependency>
<groupId>com.github.jzillmann</groupId>
<artifactId>pdf-to-markdown</artifactId>
<version>1.0.0</version>
</dependency>
Then, you can use the library in your code as shown in the examples above. For command-line usage, download the latest release JAR file from the GitHub repository and run it using the Java command as demonstrated in the first code example.
Competitor Comparisons
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pros of OCRmyPDF
- Performs OCR on PDF files, making scanned documents searchable
- Supports multiple languages and can handle complex layouts
- Actively maintained with regular updates and improvements
Cons of OCRmyPDF
- Requires more dependencies and system resources
- May alter the original PDF layout in some cases
- Steeper learning curve for advanced features
Code Comparison
OCRmyPDF:
import ocrmypdf
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True, language='eng')
pdf-to-markdown:
const pdfToMarkdown = require('pdf-to-markdown');
pdfToMarkdown('input.pdf').then(markdown => console.log(markdown));
Summary
OCRmyPDF is a powerful tool for making scanned PDFs searchable and editable, while pdf-to-markdown focuses on converting PDF content to Markdown format. OCRmyPDF offers more advanced features and language support but requires more setup. pdf-to-markdown is simpler to use but may not handle complex layouts or scanned documents as effectively. The choice between the two depends on the specific use case and whether OCR functionality is needed.
Transforms PDF, Documents and Images into Enriched Structured Data
Pros of Parsr
- More comprehensive document processing capabilities, including table extraction and layout analysis
- Supports multiple output formats (markdown, JSON, text, etc.)
- Actively maintained with regular updates and improvements
Cons of Parsr
- More complex setup and usage, requiring additional dependencies
- Heavier resource consumption due to its extensive features
- Steeper learning curve for basic PDF to markdown conversion
Code Comparison
pdf-to-markdown
const pdfToMarkdown = require('pdf-to-markdown');
pdfToMarkdown('input.pdf')
.then(markdown => console.log(markdown));
Parsr
const Parsr = require('parsr');
const config = { ... };
const parsr = new Parsr(config);
parsr.run('input.pdf')
.then(result => console.log(result.markdown));
While pdf-to-markdown offers a simpler API for basic PDF to markdown conversion, Parsr provides more configuration options and advanced features for document processing. pdf-to-markdown is easier to use for simple tasks, but Parsr offers greater flexibility and functionality for complex document analysis and conversion needs.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of Tabula
- Specialized in extracting tables from PDFs
- Offers both a GUI and command-line interface
- Supports multiple output formats (CSV, TSV, JSON)
Cons of Tabula
- Limited to table extraction, not full PDF-to-Markdown conversion
- May struggle with complex or poorly formatted PDFs
- Requires Java runtime environment
Code Comparison
pdf-to-markdown
const pdf2md = require('pdf-to-markdown');
pdf2md.convert('input.pdf', 'output.md')
.then(() => console.log('Conversion complete'))
.catch(err => console.error('Error:', err));
Tabula
import technology.tabula.CommandLineApp;
public class TabulaExample {
public static void main(String[] args) {
CommandLineApp.main(new String[]{"input.pdf", "-o", "output.csv"});
}
}
Summary
pdf-to-markdown focuses on converting entire PDF documents to Markdown format, while Tabula specializes in extracting tables from PDFs. Tabula offers more flexibility in terms of user interface and output formats for table data, but it's limited to table extraction. pdf-to-markdown provides a more comprehensive solution for converting PDFs to Markdown, including text, images, and basic formatting. The choice between the two depends on the specific use case and whether the primary focus is on table extraction or full document conversion.
Community maintained fork of pdfminer - we fathom PDF
Pros of pdfminer.six
- More comprehensive PDF parsing capabilities, including text extraction, layout analysis, and metadata extraction
- Actively maintained with regular updates and bug fixes
- Supports both Python 2 and Python 3
Cons of pdfminer.six
- Steeper learning curve due to its more complex API and extensive features
- May be overkill for simple PDF-to-markdown conversion tasks
- Requires additional processing to convert extracted content to markdown format
Code Comparison
pdfminer.six:
from pdfminer.high_level import extract_text
text = extract_text('sample.pdf')
print(text)
pdf-to-markdown:
from pdf_to_markdown import pdf_to_markdown
markdown = pdf_to_markdown('sample.pdf')
print(markdown)
pdfminer.six provides lower-level access to PDF content, requiring additional processing for markdown conversion. pdf-to-markdown offers a simpler API specifically designed for PDF-to-markdown conversion, but with potentially less flexibility for complex PDF parsing tasks.
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Pros of pdfminer
- More comprehensive PDF parsing capabilities
- Supports a wider range of PDF features and structures
- Actively maintained with regular updates
Cons of pdfminer
- Steeper learning curve due to its complexity
- Requires more setup and configuration for basic tasks
- May be overkill for simple PDF-to-markdown conversions
Code Comparison
pdfminer:
from pdfminer.high_level import extract_text
text = extract_text('sample.pdf')
print(text)
pdf-to-markdown:
PdfToMarkdownConverter converter = new PdfToMarkdownConverter();
String markdown = converter.convert("sample.pdf");
System.out.println(markdown);
Key Differences
- pdfminer is a Python library, while pdf-to-markdown is Java-based
- pdfminer offers more granular control over PDF parsing
- pdf-to-markdown is specifically designed for markdown conversion, making it simpler for this specific use case
- pdfminer has a larger community and more extensive documentation
Use Cases
- Choose pdfminer for complex PDF processing tasks or when working with Python
- Opt for pdf-to-markdown when specifically needing PDF-to-markdown conversion in a Java environment or for simpler projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PDF-To-Markdown Converter
Javascript tool to parse PDF files and convert them into Markdown format. Online version at http://pdf2md.morethan.io!
Major Changes
- 2020/2021 Currently separating the parsing logic from the frontent in order to make it separately available.
- Branch modularize
- Find the current version at https://jzillmann.github.io/pdf-to-markdown-staging/
- Help me
- Apr 2017 - 0.1: Initial Release
Contribute
Use the issue tracker and/or open pull requests!
Useful Build Commands
npm install
Download all necessary npm packagesnpm run lint
Lint the javascript filesnpm run test
Run testsnpm run check
Lint & Testnpm run build
Build the dev versionnpm run start
Run the app on an server (useful for loading of worker.js and cmaps)npm run watch
Continuously build the projectopen build/index.html
Open the build project in your default browsernpm run release
Build production versionnpm run deploy
Build production version & move it to the github pages fodler
Release
- Increase version in package.json
npm run deploy
- commit & push
- tag with
- git tag -a $releaseVersion -m "$releaseVersion release"
- git push --tags
Credits
pdf.js - Mozilla's PDF parsing & rendering platform which is used as a raw parser
Top Related Projects
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Transforms PDF, Documents and Images into Enriched Structured Data
Tabula is a tool for liberating data tables trapped inside PDF files
Community maintained fork of pdfminer - we fathom PDF
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot