Top Related Projects
Convert PDF to HTML without losing text or format.
PDF Reader in JavaScript
macOS Quick Look extension for Markdown files.
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
HTML to PDF converter for PHP
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Quick Overview
pdf2htmlEX is an open-source tool that converts PDF files to HTML, preserving the original layout and formatting as much as possible. It aims to produce high-quality HTML output that closely resembles the original PDF, making it useful for web publishing and archiving purposes.
Pros
- Maintains the original layout and formatting of PDF documents
- Supports complex PDF features like fonts, images, and vector graphics
- Produces mobile-friendly and responsive output
- Offers various customization options for fine-tuning the conversion process
Cons
- May struggle with highly complex or non-standard PDF files
- Large PDF files can result in sizeable HTML output, potentially impacting page load times
- Requires some technical knowledge to set up and use effectively
- Development and updates have been somewhat inconsistent in recent years
Getting Started
To use pdf2htmlEX, follow these steps:
- Install pdf2htmlEX on your system (instructions vary by OS)
- Open a terminal or command prompt
- Navigate to the directory containing your PDF file
- Run the following command:
pdf2htmlEX input.pdf
This will generate an HTML file named input.html
in the same directory. For more advanced usage and options, consult the pdf2htmlEX documentation.
Competitor Comparisons
Convert PDF to HTML without losing text or format.
Pros of pdf2htmlEX
- Original project by the creator, potentially more authentic implementation
- May have more historical context and documentation
- Could be more stable due to longer development history
Cons of pdf2htmlEX
- Less active development and maintenance
- Fewer recent updates and bug fixes
- Potentially outdated dependencies and compatibility issues
Code Comparison
While both repositories contain similar core functionality, there might be slight differences in implementation or recent updates. Here's a hypothetical example of how a function might differ:
pdf2htmlEX:
void process_pdf(const char* input_file) {
// Older implementation
// ...
}
coolwanglu/pdf2htmlEX:
void process_pdf(const char* input_file, const char* output_file) {
// Updated implementation with additional parameter
// ...
}
Note that this is a simplified example and may not reflect actual differences between the repositories. The main distinctions are likely to be in recent updates, bug fixes, and potentially some feature additions or optimizations in the more actively maintained fork.
PDF Reader in JavaScript
Pros of pdf.js
- Pure JavaScript implementation, runs directly in web browsers
- Extensive browser compatibility and integration with web technologies
- Active development and maintenance by Mozilla
Cons of pdf.js
- Slower rendering performance for complex PDFs
- Limited support for advanced PDF features and interactivity
Code Comparison
pdf.js:
pdfjsLib.getDocument('document.pdf').promise.then(function(pdf) {
pdf.getPage(1).then(function(page) {
var scale = 1.5;
var viewport = page.getViewport({ scale: scale });
// Render page content
});
});
pdf2htmlEX:
pdf2htmlEX input.pdf output.html
Key Differences
- pdf2htmlEX converts PDFs to HTML/CSS, while pdf.js renders PDFs directly in the browser
- pdf2htmlEX offers better text selection and search capabilities in the output
- pdf.js provides a more interactive viewing experience with zooming and page navigation
Use Cases
- pdf.js: Web-based PDF viewers, browser extensions
- pdf2htmlEX: Static HTML conversion, SEO-friendly PDF content
Community and Support
- pdf.js: Large community, frequent updates, extensive documentation
- pdf2htmlEX: Smaller community, less frequent updates, but specialized use cases
macOS Quick Look extension for Markdown files.
Pros of QLMarkdown
- Specifically designed for macOS Quick Look, providing seamless Markdown preview integration
- Supports a wide range of Markdown flavors and extensions
- Actively maintained with regular updates and improvements
Cons of QLMarkdown
- Limited to macOS platform, not cross-platform like pdf2htmlEX
- Focused solely on Markdown, while pdf2htmlEX handles PDF conversion to HTML
Code Comparison
QLMarkdown (Objective-C):
- (NSData *)dataOfType:(NSString *)typeName error:(NSError **)outError {
[NSException raise:@"UnimplementedMethod" format:@"%@ is unimplemented", NSStringFromSelector(_cmd)];
return nil;
}
pdf2htmlEX (C++):
void HTMLRenderer::process(PDFDoc *doc) {
xref = doc->getXRef();
catalog = doc->getCatalog();
if(!catalog) throw "Cannot read the catalog";
// ...
}
Note: The code snippets are not directly comparable as they serve different purposes. QLMarkdown focuses on macOS Quick Look integration, while pdf2htmlEX handles PDF to HTML conversion.
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
Pros of openhtmltopdf
- Java-based, making it easier to integrate with Java applications
- Supports a wider range of CSS features, including flexbox and grid layouts
- Actively maintained with regular updates and improvements
Cons of openhtmltopdf
- Limited support for complex PDF structures and interactive elements
- May have slower performance compared to native C++ implementations
- Requires Java runtime environment, which can be a drawback in some scenarios
Code Comparison
openhtmltopdf:
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withUri("input.html");
builder.toStream(outputStream);
builder.run();
pdf2htmlEX:
pdf2htmlEX input.pdf output.html
The code comparison shows that openhtmltopdf is used programmatically within Java applications, while pdf2htmlEX is typically used as a command-line tool. This reflects their different approaches and use cases.
openhtmltopdf is better suited for Java developers who need to generate PDFs from HTML within their applications. It offers more control over the conversion process and supports modern CSS features.
pdf2htmlEX, on the other hand, excels at converting existing PDFs to HTML, preserving the original layout and formatting. It's more suitable for scenarios where you need to make PDF content accessible on the web or in applications that can render HTML.
HTML to PDF converter for PHP
Pros of dompdf
- PHP-based, making it easy to integrate with PHP applications
- Generates PDFs from HTML and CSS, allowing for dynamic content creation
- Supports a wide range of CSS properties and features
Cons of dompdf
- Limited support for complex layouts and advanced PDF features
- May struggle with large documents or heavy image content
- Performance can be slower compared to native PDF manipulation libraries
Code Comparison
dompdf:
require_once 'dompdf/autoload.inc.php';
$dompdf = new Dompdf();
$dompdf->loadHtml('<h1>Hello, World!</h1>');
$dompdf->render();
$dompdf->stream("document.pdf");
pdf2htmlEX:
pdf2htmlEX input.pdf output.html
pdf2htmlEX is primarily a command-line tool, while dompdf is a PHP library. pdf2htmlEX focuses on converting PDFs to HTML, whereas dompdf generates PDFs from HTML. The choice between them depends on the specific use case and requirements of the project.
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Pros of itext-java
- Comprehensive Java library for creating and manipulating PDFs
- Extensive documentation and community support
- Supports advanced PDF features like digital signatures and form filling
Cons of itext-java
- Primarily focused on PDF creation and manipulation, not HTML conversion
- Steeper learning curve due to its extensive feature set
- Commercial licensing required for some use cases
Code Comparison
itext-java:
PdfDocument pdf = new PdfDocument(new PdfWriter("output.pdf"));
Document document = new Document(pdf);
document.add(new Paragraph("Hello, World!"));
document.close();
pdf2htmlEX:
pdf2htmlEX input.pdf output.html
Key Differences
- pdf2htmlEX is specifically designed for converting PDF to HTML, while itext-java is a more general-purpose PDF library
- itext-java offers programmatic PDF creation and manipulation, whereas pdf2htmlEX focuses on conversion
- pdf2htmlEX is a command-line tool, while itext-java is a Java library integrated into applications
Use Cases
- Choose itext-java for creating, editing, or manipulating PDFs within Java applications
- Opt for pdf2htmlEX when you need to convert existing PDFs to HTML format, especially for web display
Both tools serve different purposes and can be complementary in a workflow that involves both PDF manipulation and conversion to HTML.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
pdf2htmlEX
Differences from upstream pdf2htmlEX:
This is my branch of pdf2htmlEX which aims to allow an open collaboration to help keep the project active. A number of changes and improvements have been incorporated from other forks:
- Lots of bugs fixes, mostly of edge cases
- Integration of latest Cairo code
- Out of source building
- Rewritten handling of obscured/partially obscured text - now much more accurate
- Some support for transparent text
- Improvement of DPI settings - clamping of DPI to ensure output graphic isn't too big
--correct-text-visibility
tracks the visibility of 4 sample points for each character (currently the 4 corners of the character's bounding box, inset slightly) to determine visibility.
It now has two modes. 1 = Fully occluded text handled (i.e. doesn't get put into the HTML layer). 2 = Partially occluded text handled.
The default is now "1", so fully occluded text should no longer show through. If "2" is selected then if the character is partially occluded it will be drawn in the background layer. In this case, the rendered DPI of the page will be automatically increased to --covered-text-dpi
(default: 300) to reduce the impact of rasterized text.
For maximum accuracy I strongly recommend using the output options: --font-size-multiplier 1 --zoom 25
. This will circumvent rounding errors inside web browsers. You will then have to scale down the resulting HTML page using an appropriate "scale" transform.
If you are concerned about file size of the resulting HTML, then I recommend patching fontforge to prevent it writing the current time into the dumped fonts, and then post-process the pdf2htmlEX data to remove duplicate files - there will usually be many duplicate background images and fonts.
ä¸å¾èåè¨
A beautiful demo is worth a thousand words
- Bible de Genève, 1564 (fonts and typography): HTML / PDF
- Cheat Sheet (math formulas): HTML / PDF
- Scientific Paper (text and figures): HTML / PDF
- Full Circle Magazine (read while downloading): HTML / PDF
- Git Manual (CJK support): HTML / PDF
pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. Academic papers with lots of formulas and figures? Magazines with complicated layouts? No problem!
pdf2htmlEX is also an online publishing tool which is flexible for many different use cases.
Learn more about who and why should use pdf2htmlEX.
Features
- Native HTML text with precise font and location.
- Flexible output: all-in-one HTML or on demand page loading (needs JavaScript).
- Moderate file size, sometimes even smaller than PDF.
- Supporting links, outlines (bookmarks), printing, SVG background, Type 3 fonts and more...
Portals
- :house:Wiki Home
- Download & Building
- Quick Start
- Report Issues / Ask for Help
- :question:FAQ
- :envelope:Mailing List
- :mahjong:ä¸æé®ä»¶å表
LICENSE
pdf2htmlEX, as a whole package, is licensed under GPLv3+.
Some resource files are released with relaxed licenses, read LICENSE
for more details.
Acknowledgements
pdf2htmlEX is made possible thanks to the following projects:
pdf2htmlEX is inspired by the following projects:
- pdftohtml from poppler
- MuPDF
- PDF.js
- Crocodoc
- Google Doc
Special Thanks
- Hongliang Tian
- Wanmin Liu
Top Related Projects
Convert PDF to HTML without losing text or format.
PDF Reader in JavaScript
macOS Quick Look extension for Markdown files.
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
HTML to PDF converter for PHP
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot