Convert Figma logo to code with AI

VikParuchuri logomarker

Convert PDF to markdown quickly with high accuracy

17,057
973
17,057
138

Top Related Projects

56,019

Inference code for Llama models

69,530

Robust Speech Recognition via Large-Scale Weak Supervision

Port of OpenAI's Whisper model in C/C++

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

36,728

The simplest, fastest repository for training/finetuning medium-sized GPTs.

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Quick Overview

Marker is an open-source Python library for converting PDF files and images to markdown. It uses advanced machine learning techniques to accurately extract and format text, tables, and images from documents, making it easier to work with content from various sources in a markdown format.

Pros

  • High accuracy in text and layout extraction
  • Supports both PDF and image input formats
  • Preserves formatting, including tables and images
  • Easy to use with a simple Python API

Cons

  • Requires significant computational resources for processing
  • May struggle with highly complex or non-standard document layouts
  • Limited support for handwritten text or unusual fonts
  • Dependency on external libraries and models

Code Examples

  1. Basic usage to convert a PDF to markdown:
from marker import Marker

marker = Marker()
markdown = marker.convert_to_markdown("input.pdf")
print(markdown)
  1. Converting an image to markdown:
from marker import Marker

marker = Marker()
markdown = marker.convert_to_markdown("input.jpg")
print(markdown)
  1. Customizing output options:
from marker import Marker

marker = Marker()
markdown = marker.convert_to_markdown(
    "input.pdf",
    include_images=True,
    table_format="github"
)
print(markdown)

Getting Started

To get started with Marker, follow these steps:

  1. Install Marker using pip:

    pip install marker-pdf
    
  2. Import and use Marker in your Python script:

    from marker import Marker
    
    marker = Marker()
    markdown = marker.convert_to_markdown("your_document.pdf")
    
    # Save the markdown to a file
    with open("output.md", "w") as f:
        f.write(markdown)
    

This will convert your PDF or image file to markdown and save it as "output.md" in the current directory.

Competitor Comparisons

56,019

Inference code for Llama models

Pros of Llama

  • Developed by Meta AI, benefiting from extensive resources and research
  • Supports multiple languages and tasks beyond text generation
  • Offers various model sizes for different computational requirements

Cons of Llama

  • Requires more computational resources to run effectively
  • Less focused on specific document processing tasks
  • May have stricter licensing and usage restrictions

Code Comparison

Marker:

from marker import Marker

marker = Marker()
result = marker.mark(document)
print(result.summary)

Llama:

from llama import Llama

llm = Llama(model_path="path/to/model.bin")
output = llm.generate("Your prompt here", max_tokens=100)
print(output)

Key Differences

Marker is specifically designed for document processing and summarization, while Llama is a more general-purpose language model. Marker focuses on extracting key information from documents, whereas Llama can be used for a wider range of natural language processing tasks.

Marker is likely easier to set up and use for document-specific tasks, while Llama offers more flexibility but may require more expertise to implement effectively. The choice between the two depends on the specific use case and available resources.

69,530

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

  • More extensive language support (80+ languages)
  • Highly accurate transcription, especially for English
  • Robust to background noise and accents

Cons of Whisper

  • Larger model size, requiring more computational resources
  • Slower processing speed, especially for longer audio files
  • Less flexible for fine-tuning on specific domains or accents

Code Comparison

Whisper:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Marker:

from marker import transcribe

result = transcribe("audio.mp3")
print(result)

Key Differences

  • Marker focuses on speed and efficiency, while Whisper prioritizes accuracy and language coverage
  • Marker is designed for easier fine-tuning and customization
  • Whisper has a more extensive research backing and is widely adopted in the industry
  • Marker aims to be more lightweight and suitable for edge devices or resource-constrained environments

Both projects offer valuable solutions for speech recognition, with Whisper excelling in multilingual support and accuracy, while Marker emphasizes speed and customization. The choice between them depends on specific use cases and resource availability.

Port of OpenAI's Whisper model in C/C++

Pros of whisper.cpp

  • Highly optimized C++ implementation, offering faster performance
  • Supports various platforms and architectures, including mobile devices
  • Provides real-time audio processing capabilities

Cons of whisper.cpp

  • Limited to OpenAI's Whisper model, while marker supports multiple models
  • Requires more manual setup and configuration compared to marker's user-friendly interface
  • Less flexibility in terms of customization and fine-tuning options

Code Comparison

whisper.cpp:

// Initialize whisper context
struct whisper_context * ctx = whisper_init_from_file("ggml-base.en.bin");

// Process audio
whisper_full_default(ctx, wparams, pcmf32.data(), pcmf32.size());

// Print result
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
    const char * text = whisper_full_get_segment_text(ctx, i);
    printf("%s", text);
}

marker:

from marker import marker

# Load model and transcribe audio
model = marker.get_model("base.en")
result = model.transcribe("audio.mp3")

# Print result
for segment in result.segments:
    print(segment.text)

The code comparison demonstrates the simplicity of marker's Python interface compared to the more low-level C++ implementation of whisper.cpp. While whisper.cpp offers finer control and potentially better performance, marker provides a more user-friendly and Pythonic approach to audio transcription.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Extensive library with support for numerous pre-trained models and architectures
  • Well-documented and actively maintained by a large community
  • Seamless integration with other Hugging Face tools and datasets

Cons of transformers

  • Steeper learning curve due to its comprehensive nature
  • Can be resource-intensive for smaller projects or limited hardware
  • May include unnecessary features for specific use cases

Code comparison

transformers:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

marker:

from marker.convert import convert_pdf_to_images
from marker.ocr import ocr_images

images = convert_pdf_to_images("document.pdf")
text = ocr_images(images)
print(text)

Key differences

  • transformers focuses on NLP tasks and model implementations
  • marker specializes in document processing and OCR
  • transformers offers a wider range of pre-trained models and tasks
  • marker provides specific tools for PDF conversion and image processing
36,728

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Pros of nanoGPT

  • Simpler implementation, focusing on core GPT architecture
  • Excellent educational resource for understanding transformer models
  • Highly optimized for performance on single GPU setups

Cons of nanoGPT

  • Limited features compared to Marker's more comprehensive toolkit
  • Less focus on practical applications and fine-tuning for specific tasks
  • Requires more expertise to adapt for real-world use cases

Code Comparison

nanoGPT:

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

Marker:

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = nn.MultiheadAttention(embed_dim, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim),
        )
34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • Highly optimized for large-scale distributed training of deep learning models
  • Supports a wide range of model architectures and training scenarios
  • Integrates seamlessly with popular frameworks like PyTorch and Hugging Face

Cons of DeepSpeed

  • Steeper learning curve due to its complexity and advanced features
  • Primarily focused on training, with less emphasis on inference optimization
  • Requires more setup and configuration for optimal performance

Code Comparison

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
                                                     model=model,
                                                     model_parameters=params)

Marker:

from marker import Marker
marker = Marker()
summary = marker.summarize(text)

Key Differences

  • DeepSpeed is a comprehensive deep learning optimization library, while Marker is specifically designed for text summarization
  • DeepSpeed focuses on distributed training and model parallelism, whereas Marker emphasizes ease of use for text processing tasks
  • DeepSpeed requires more setup and configuration, while Marker offers a simpler API for quick implementation

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Marker

Marker converts PDF to markdown quickly and accurately.

  • Supports a wide range of documents (optimized for books and scientific papers)
  • Supports all languages
  • Removes headers/footers/other artifacts
  • Formats tables and code blocks
  • Extracts and saves images along with the markdown
  • Converts most equations to latex
  • Works on GPU, CPU, or MPS

How it works

Marker is a pipeline of deep learning models:

  • Extract text, OCR if necessary (heuristics, surya, tesseract)
  • Detect page layout and find reading order (surya)
  • Clean and format each block (heuristics, texify
  • Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)

It only uses models where necessary, which improves speed and accuracy.

Examples

PDFTypeMarkerNougat
Think PythonTextbookViewView
Think OSTextbookViewView
Switch TransformersarXiv paperViewView
Multi-column CNNarXiv paperViewView

Performance

Benchmark overall

The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Commercial usage

I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Hosted API

There's a hosted API for marker available here:

  • Supports PDFs, word documents, and powerpoints
  • 1/4th the price of leading cloud-based competitors
  • Leverages Modal for high reliability without latency spikes

Community

Discord is where we discuss future development.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

  • Marker will not convert 100% of equations to LaTeX. This is because it has to detect then convert.
  • Tables are not always formatted 100% correctly - text can be in the wrong column.
  • Whitespace and indentations are not always respected.
  • Not all lines/spans will be joined properly.
  • This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install marker-pdf

Optional: OCRMyPDF

Only needed if you want to use the optional ocrmypdf as the ocr backend. Note that ocrmypdf includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions.

See the instructions here

Usage

First, some configuration:

  • Inspect the settings in marker/settings.py. You can override any settings with environment variables.
  • Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda.
  • By default, marker will use surya for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set OCR_ENGINE to ocrmypdf. This also requires external dependencies (see above). If you don't want OCR at all, set OCR_ENGINE to None.

Interactive App

I've included a streamlit app that lets you interactively try marker with some basic options. Run it with:

pip install streamlit
marker_gui

Convert a single file

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 
  • --batch_multiplier is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
  • --max_pages is the maximum number of pages to process. Omit this to convert the entire document.
  • --langs is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
  • --ocr_all_pages is an optional argument to force OCR on all pages of the PDF. If this or the env var OCR_ALL_PAGES are true, OCR will be forced.

The list of supported languages for surya OCR is here. If you need more languages, you can use any language supported by Tesseract if you set OCR_ENGINE to ocrmypdf. If you don't need OCR, marker can work with any language.

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
  • --workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
  • --max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
  • --min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
  • --metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. Setting language is optional for surya (default), but required for tesseract. The format is:
{
  "pdf1.pdf": {"languages": ["English"]},
  "pdf2.pdf": {"languages": ["Spanish", "Russian"]},
  ...
}

You can use language names or codes. The exact codes depend on the OCR engine. See here for a full list for surya codes, and here for tesseract.

Convert multiple files on multiple GPUs

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
  • METADATA_FILE is an optional path to a json file with metadata about the pdfs. See above for the format.
  • NUM_DEVICES is the number of GPUs to use. Should be 2 or greater.
  • NUM_WORKERS is the number of parallel processes to run on each GPU.
  • MIN_LENGTH is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

Note that the env variables above are specific to this script, and cannot be set in local.env.

Troubleshooting

There are some settings that you may find useful if things aren't working the way you expect:

  • OCR_ALL_PAGES - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
  • TORCH_DEVICE - set this to force marker to use a given torch device for inference.
  • OCR_ENGINE - can set this to surya or ocrmypdf.
  • DEBUG - setting this to True shows ray logs when converting multiple pdfs
  • Verify that you set the languages correctly, or passed in a metadata file.
  • If you're getting out of memory errors, decrease worker count (increased the VRAM_PER_TASK setting). You can also try splitting up long PDFs into multiple files.

In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.

Useful settings

These settings can improve/change output quality:

  • OCR_ALL_PAGES will force OCR across the document. Many PDFs have bad text embedded due to older OCR engines being used.
  • PAGINATE_OUTPUT will put a horizontal rule between pages. Default: False.
  • EXTRACT_IMAGES will extract images and save separately. Default: True.
  • BAD_SPAN_TYPES specifies layout blocks to remove from the markdown output.

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

MethodAverage ScoreTime per pageTime per document
marker0.6137210.63199158.1432
nougat0.4066032.59702238.926

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Methodmulticolcnn.pdfswitch_trans.pdfthinkpython.pdfthinkos.pdfthinkdsp.pdfcrowd.pdf
marker0.5361760.5168330.705150.7106570.6900420.523467
nougat0.440090.5889730.3227060.4013420.1608420.525663

Peak GPU memory usage during the benchmark is 4.2GB for nougat, and 4.1GB for marker. Benchmarks were run on an A6000 Ada.

Throughput

Marker takes about 4GB of VRAM on average per task, so you can convert 12 documents in parallel on an A6000.

Benchmark results

Running your own benchmarks

You can benchmark the performance of marker on your machine. Install marker manually with:

git clone https://github.com/VikParuchuri/marker.git
poetry install

Download the benchmark data here and unzip. Then run the overall benchmark like this:

python benchmark/overall.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Table benchmark

There is a benchmark for table parsing, which you can run with:

python benchmarks/table.py test_data/tables.json

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

  • Surya
  • Texify
  • Pypdfium2/pdfium
  • DocLayNet from IBM
  • ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!