Convert Figma logo to code with AI

jzillmann logopdf-to-markdown

A PDF to Markdown converter

1,239
207
1,239
31

Top Related Projects

14,361

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

5,872

Transforms PDF, Documents and Images into Enriched Structured Data

6,831

Tabula is a tool for liberating data tables trapped inside PDF files

Community maintained fork of pdfminer - we fathom PDF

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Quick Overview

pdf-to-markdown is a Java-based tool that converts PDF files to Markdown format. It aims to preserve the structure and formatting of the original PDF document while producing clean and readable Markdown output. The project is designed to be used as a command-line tool or integrated into other Java applications.

Pros

  • Preserves document structure and formatting
  • Supports both command-line usage and integration into Java applications
  • Handles various PDF elements like tables, images, and lists
  • Open-source and actively maintained

Cons

  • Limited support for complex PDF layouts
  • May struggle with PDFs containing non-standard fonts or encodings
  • Requires Java runtime environment
  • Output quality can vary depending on the input PDF's complexity

Code Examples

  1. Converting a PDF file to Markdown using the command-line interface:
java -jar pdf-to-markdown.jar input.pdf output.md
  1. Using the library in a Java application to convert a PDF file:
import com.github.jzillmann.pdf.PdfToMarkdownConverter;

PdfToMarkdownConverter converter = new PdfToMarkdownConverter();
String markdown = converter.convert("input.pdf");
  1. Customizing conversion options:
import com.github.jzillmann.pdf.PdfToMarkdownConverter;
import com.github.jzillmann.pdf.ConversionOptions;

ConversionOptions options = new ConversionOptions.Builder()
    .setExtractImages(true)
    .setImageOutputDirectory("images/")
    .build();

PdfToMarkdownConverter converter = new PdfToMarkdownConverter(options);
String markdown = converter.convert("input.pdf");

Getting Started

To use pdf-to-markdown in your Java project, add the following dependency to your pom.xml file:

<dependency>
    <groupId>com.github.jzillmann</groupId>
    <artifactId>pdf-to-markdown</artifactId>
    <version>1.0.0</version>
</dependency>

Then, you can use the library in your code as shown in the examples above. For command-line usage, download the latest release JAR file from the GitHub repository and run it using the Java command as demonstrated in the first code example.

Competitor Comparisons

14,361

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pros of OCRmyPDF

  • Performs OCR on PDF files, making scanned documents searchable
  • Supports multiple languages and can handle complex layouts
  • Actively maintained with regular updates and improvements

Cons of OCRmyPDF

  • Requires more dependencies and system resources
  • May alter the original PDF layout in some cases
  • Steeper learning curve for advanced features

Code Comparison

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True, language='eng')

pdf-to-markdown:

const pdfToMarkdown = require('pdf-to-markdown');

pdfToMarkdown('input.pdf').then(markdown => console.log(markdown));

Summary

OCRmyPDF is a powerful tool for making scanned PDFs searchable and editable, while pdf-to-markdown focuses on converting PDF content to Markdown format. OCRmyPDF offers more advanced features and language support but requires more setup. pdf-to-markdown is simpler to use but may not handle complex layouts or scanned documents as effectively. The choice between the two depends on the specific use case and whether OCR functionality is needed.

5,872

Transforms PDF, Documents and Images into Enriched Structured Data

Pros of Parsr

  • More comprehensive document processing capabilities, including table extraction and layout analysis
  • Supports multiple output formats (markdown, JSON, text, etc.)
  • Actively maintained with regular updates and improvements

Cons of Parsr

  • More complex setup and usage, requiring additional dependencies
  • Heavier resource consumption due to its extensive features
  • Steeper learning curve for basic PDF to markdown conversion

Code Comparison

pdf-to-markdown

const pdfToMarkdown = require('pdf-to-markdown');

pdfToMarkdown('input.pdf')
  .then(markdown => console.log(markdown));

Parsr

const Parsr = require('parsr');

const config = { ... };
const parsr = new Parsr(config);
parsr.run('input.pdf')
  .then(result => console.log(result.markdown));

While pdf-to-markdown offers a simpler API for basic PDF to markdown conversion, Parsr provides more configuration options and advanced features for document processing. pdf-to-markdown is easier to use for simple tasks, but Parsr offers greater flexibility and functionality for complex document analysis and conversion needs.

6,831

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

  • Specialized in extracting tables from PDFs
  • Offers both a GUI and command-line interface
  • Supports multiple output formats (CSV, TSV, JSON)

Cons of Tabula

  • Limited to table extraction, not full PDF-to-Markdown conversion
  • May struggle with complex or poorly formatted PDFs
  • Requires Java runtime environment

Code Comparison

pdf-to-markdown

const pdf2md = require('pdf-to-markdown');

pdf2md.convert('input.pdf', 'output.md')
  .then(() => console.log('Conversion complete'))
  .catch(err => console.error('Error:', err));

Tabula

import technology.tabula.CommandLineApp;

public class TabulaExample {
    public static void main(String[] args) {
        CommandLineApp.main(new String[]{"input.pdf", "-o", "output.csv"});
    }
}

Summary

pdf-to-markdown focuses on converting entire PDF documents to Markdown format, while Tabula specializes in extracting tables from PDFs. Tabula offers more flexibility in terms of user interface and output formats for table data, but it's limited to table extraction. pdf-to-markdown provides a more comprehensive solution for converting PDFs to Markdown, including text, images, and basic formatting. The choice between the two depends on the specific use case and whether the primary focus is on table extraction or full document conversion.

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

  • More comprehensive PDF parsing capabilities, including text extraction, layout analysis, and metadata extraction
  • Actively maintained with regular updates and bug fixes
  • Supports both Python 2 and Python 3

Cons of pdfminer.six

  • Steeper learning curve due to its more complex API and extensive features
  • May be overkill for simple PDF-to-markdown conversion tasks
  • Requires additional processing to convert extracted content to markdown format

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)

pdf-to-markdown:

from pdf_to_markdown import pdf_to_markdown

markdown = pdf_to_markdown('sample.pdf')
print(markdown)

pdfminer.six provides lower-level access to PDF content, requiring additional processing for markdown conversion. pdf-to-markdown offers a simpler API specifically designed for PDF-to-markdown conversion, but with potentially less flexibility for complex PDF parsing tasks.

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

  • More comprehensive PDF parsing capabilities
  • Supports a wider range of PDF features and structures
  • Actively maintained with regular updates

Cons of pdfminer

  • Steeper learning curve due to its complexity
  • Requires more setup and configuration for basic tasks
  • May be overkill for simple PDF-to-markdown conversions

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)

pdf-to-markdown:

PdfToMarkdownConverter converter = new PdfToMarkdownConverter();
String markdown = converter.convert("sample.pdf");
System.out.println(markdown);

Key Differences

  • pdfminer is a Python library, while pdf-to-markdown is Java-based
  • pdfminer offers more granular control over PDF parsing
  • pdf-to-markdown is specifically designed for markdown conversion, making it simpler for this specific use case
  • pdfminer has a larger community and more extensive documentation

Use Cases

  • Choose pdfminer for complex PDF processing tasks or when working with Python
  • Opt for pdf-to-markdown when specifically needing PDF-to-markdown conversion in a Java environment or for simpler projects

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PDF-To-Markdown Converter

Javascript tool to parse PDF files and convert them into Markdown format. Online version at http://pdf2md.morethan.io!

Major Changes

Contribute

Use the issue tracker and/or open pull requests!

Useful Build Commands

  • npm install Download all necessary npm packages
  • npm run lint Lint the javascript files
  • npm run test Run tests
  • npm run check Lint & Test
  • npm run build Build the dev version
  • npm run start Run the app on an server (useful for loading of worker.js and cmaps)
  • npm run watch Continuously build the project
  • open build/index.html Open the build project in your default browser
  • npm run release Build production version
  • npm run deploy Build production version & move it to the github pages fodler

Release

  • Increase version in package.json
  • npm run deploy
  • commit & push
  • tag with
    • git tag -a $releaseVersion -m "$releaseVersion release"
    • git push --tags

Credits

pdf.js - Mozilla's PDF parsing & rendering platform which is used as a raw parser