Convert Figma logo to code with AI

pdf2htmlEX logopdf2htmlEX

Convert PDF to HTML without losing text or format.

4,613
421
4,613
114

Top Related Projects

Convert PDF to HTML without losing text or format.

48,991

PDF Reader in JavaScript

macOS Quick Look extension for Markdown files.

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

10,495

HTML to PDF converter for PHP

iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.

Quick Overview

pdf2htmlEX is an open-source tool that converts PDF files to HTML, preserving the original layout and formatting as much as possible. It aims to produce high-quality HTML output that closely resembles the original PDF, making it useful for web publishing and archiving purposes.

Pros

  • Maintains the original layout and formatting of PDF documents
  • Supports complex PDF features like fonts, images, and vector graphics
  • Produces mobile-friendly and responsive output
  • Offers various customization options for fine-tuning the conversion process

Cons

  • May struggle with highly complex or non-standard PDF files
  • Large PDF files can result in sizeable HTML output, potentially impacting page load times
  • Requires some technical knowledge to set up and use effectively
  • Development and updates have been somewhat inconsistent in recent years

Getting Started

To use pdf2htmlEX, follow these steps:

  1. Install pdf2htmlEX on your system (instructions vary by OS)
  2. Open a terminal or command prompt
  3. Navigate to the directory containing your PDF file
  4. Run the following command:
pdf2htmlEX input.pdf

This will generate an HTML file named input.html in the same directory. For more advanced usage and options, consult the pdf2htmlEX documentation.

Competitor Comparisons

Convert PDF to HTML without losing text or format.

Pros of pdf2htmlEX

  • Original project by the creator, potentially more authentic implementation
  • May have more historical context and documentation
  • Could be more stable due to longer development history

Cons of pdf2htmlEX

  • Less active development and maintenance
  • Fewer recent updates and bug fixes
  • Potentially outdated dependencies and compatibility issues

Code Comparison

While both repositories contain similar core functionality, there might be slight differences in implementation or recent updates. Here's a hypothetical example of how a function might differ:

pdf2htmlEX:

void process_pdf(const char* input_file) {
    // Older implementation
    // ...
}

coolwanglu/pdf2htmlEX:

void process_pdf(const char* input_file, const char* output_file) {
    // Updated implementation with additional parameter
    // ...
}

Note that this is a simplified example and may not reflect actual differences between the repositories. The main distinctions are likely to be in recent updates, bug fixes, and potentially some feature additions or optimizations in the more actively maintained fork.

48,991

PDF Reader in JavaScript

Pros of pdf.js

  • Pure JavaScript implementation, runs directly in web browsers
  • Extensive browser compatibility and integration with web technologies
  • Active development and maintenance by Mozilla

Cons of pdf.js

  • Slower rendering performance for complex PDFs
  • Limited support for advanced PDF features and interactivity

Code Comparison

pdf.js:

pdfjsLib.getDocument('document.pdf').promise.then(function(pdf) {
  pdf.getPage(1).then(function(page) {
    var scale = 1.5;
    var viewport = page.getViewport({ scale: scale });
    // Render page content
  });
});

pdf2htmlEX:

pdf2htmlEX input.pdf output.html

Key Differences

  • pdf2htmlEX converts PDFs to HTML/CSS, while pdf.js renders PDFs directly in the browser
  • pdf2htmlEX offers better text selection and search capabilities in the output
  • pdf.js provides a more interactive viewing experience with zooming and page navigation

Use Cases

  • pdf.js: Web-based PDF viewers, browser extensions
  • pdf2htmlEX: Static HTML conversion, SEO-friendly PDF content

Community and Support

  • pdf.js: Large community, frequent updates, extensive documentation
  • pdf2htmlEX: Smaller community, less frequent updates, but specialized use cases

macOS Quick Look extension for Markdown files.

Pros of QLMarkdown

  • Specifically designed for macOS Quick Look, providing seamless Markdown preview integration
  • Supports a wide range of Markdown flavors and extensions
  • Actively maintained with regular updates and improvements

Cons of QLMarkdown

  • Limited to macOS platform, not cross-platform like pdf2htmlEX
  • Focused solely on Markdown, while pdf2htmlEX handles PDF conversion to HTML

Code Comparison

QLMarkdown (Objective-C):

- (NSData *)dataOfType:(NSString *)typeName error:(NSError **)outError {
    [NSException raise:@"UnimplementedMethod" format:@"%@ is unimplemented", NSStringFromSelector(_cmd)];
    return nil;
}

pdf2htmlEX (C++):

void HTMLRenderer::process(PDFDoc *doc) {
    xref = doc->getXRef();
    catalog = doc->getCatalog();
    if(!catalog) throw "Cannot read the catalog";
    // ...
}

Note: The code snippets are not directly comparable as they serve different purposes. QLMarkdown focuses on macOS Quick Look integration, while pdf2htmlEX handles PDF to HTML conversion.

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

Pros of openhtmltopdf

  • Java-based, making it easier to integrate with Java applications
  • Supports a wider range of CSS features, including flexbox and grid layouts
  • Actively maintained with regular updates and improvements

Cons of openhtmltopdf

  • Limited support for complex PDF structures and interactive elements
  • May have slower performance compared to native C++ implementations
  • Requires Java runtime environment, which can be a drawback in some scenarios

Code Comparison

openhtmltopdf:

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withUri("input.html");
builder.toStream(outputStream);
builder.run();

pdf2htmlEX:

pdf2htmlEX input.pdf output.html

The code comparison shows that openhtmltopdf is used programmatically within Java applications, while pdf2htmlEX is typically used as a command-line tool. This reflects their different approaches and use cases.

openhtmltopdf is better suited for Java developers who need to generate PDFs from HTML within their applications. It offers more control over the conversion process and supports modern CSS features.

pdf2htmlEX, on the other hand, excels at converting existing PDFs to HTML, preserving the original layout and formatting. It's more suitable for scenarios where you need to make PDF content accessible on the web or in applications that can render HTML.

10,495

HTML to PDF converter for PHP

Pros of dompdf

  • PHP-based, making it easy to integrate with PHP applications
  • Generates PDFs from HTML and CSS, allowing for dynamic content creation
  • Supports a wide range of CSS properties and features

Cons of dompdf

  • Limited support for complex layouts and advanced PDF features
  • May struggle with large documents or heavy image content
  • Performance can be slower compared to native PDF manipulation libraries

Code Comparison

dompdf:

require_once 'dompdf/autoload.inc.php';
$dompdf = new Dompdf();
$dompdf->loadHtml('<h1>Hello, World!</h1>');
$dompdf->render();
$dompdf->stream("document.pdf");

pdf2htmlEX:

pdf2htmlEX input.pdf output.html

pdf2htmlEX is primarily a command-line tool, while dompdf is a PHP library. pdf2htmlEX focuses on converting PDFs to HTML, whereas dompdf generates PDFs from HTML. The choice between them depends on the specific use case and requirements of the project.

iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.

Pros of itext-java

  • Comprehensive Java library for creating and manipulating PDFs
  • Extensive documentation and community support
  • Supports advanced PDF features like digital signatures and form filling

Cons of itext-java

  • Primarily focused on PDF creation and manipulation, not HTML conversion
  • Steeper learning curve due to its extensive feature set
  • Commercial licensing required for some use cases

Code Comparison

itext-java:

PdfDocument pdf = new PdfDocument(new PdfWriter("output.pdf"));
Document document = new Document(pdf);
document.add(new Paragraph("Hello, World!"));
document.close();

pdf2htmlEX:

pdf2htmlEX input.pdf output.html

Key Differences

  • pdf2htmlEX is specifically designed for converting PDF to HTML, while itext-java is a more general-purpose PDF library
  • itext-java offers programmatic PDF creation and manipulation, whereas pdf2htmlEX focuses on conversion
  • pdf2htmlEX is a command-line tool, while itext-java is a Java library integrated into applications

Use Cases

  • Choose itext-java for creating, editing, or manipulating PDFs within Java applications
  • Opt for pdf2htmlEX when you need to convert existing PDFs to HTML format, especially for web display

Both tools serve different purposes and can be complementary in a workflow that involves both PDF manipulation and conversion to HTML.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pdf2htmlEX

Build Status

Differences from upstream pdf2htmlEX:

This is my branch of pdf2htmlEX which aims to allow an open collaboration to help keep the project active. A number of changes and improvements have been incorporated from other forks:

  • Lots of bugs fixes, mostly of edge cases
  • Integration of latest Cairo code
  • Out of source building
  • Rewritten handling of obscured/partially obscured text - now much more accurate
  • Some support for transparent text
  • Improvement of DPI settings - clamping of DPI to ensure output graphic isn't too big

--correct-text-visibility tracks the visibility of 4 sample points for each character (currently the 4 corners of the character's bounding box, inset slightly) to determine visibility. It now has two modes. 1 = Fully occluded text handled (i.e. doesn't get put into the HTML layer). 2 = Partially occluded text handled.

The default is now "1", so fully occluded text should no longer show through. If "2" is selected then if the character is partially occluded it will be drawn in the background layer. In this case, the rendered DPI of the page will be automatically increased to --covered-text-dpi (default: 300) to reduce the impact of rasterized text.

For maximum accuracy I strongly recommend using the output options: --font-size-multiplier 1 --zoom 25. This will circumvent rounding errors inside web browsers. You will then have to scale down the resulting HTML page using an appropriate "scale" transform.

If you are concerned about file size of the resulting HTML, then I recommend patching fontforge to prevent it writing the current time into the dumped fonts, and then post-process the pdf2htmlEX data to remove duplicate files - there will usually be many duplicate background images and fonts.

一图胜千言
A beautiful demo is worth a thousand words

  • Bible de Genève, 1564 (fonts and typography): HTML / PDF
  • Cheat Sheet (math formulas): HTML / PDF
  • Scientific Paper (text and figures): HTML / PDF
  • Full Circle Magazine (read while downloading): HTML / PDF
  • Git Manual (CJK support): HTML / PDF

pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. Academic papers with lots of formulas and figures? Magazines with complicated layouts? No problem!

pdf2htmlEX is also an online publishing tool which is flexible for many different use cases.

Learn more about who and why should use pdf2htmlEX.

Features

  • Native HTML text with precise font and location.
  • Flexible output: all-in-one HTML or on demand page loading (needs JavaScript).
  • Moderate file size, sometimes even smaller than PDF.
  • Supporting links, outlines (bookmarks), printing, SVG background, Type 3 fonts and more...

Compare to others

Portals

LICENSE

pdf2htmlEX, as a whole package, is licensed under GPLv3+. Some resource files are released with relaxed licenses, read LICENSE for more details.

Acknowledgements

pdf2htmlEX is made possible thanks to the following projects:

Testing Powered By SauceLabs

pdf2htmlEX is inspired by the following projects:

  • pdftohtml from poppler
  • MuPDF
  • PDF.js
  • Crocodoc
  • Google Doc

Special Thanks

  • Hongliang Tian
  • Wanmin Liu