Top Related Projects
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
PDF Reader in JavaScript
Quick Overview
Apache PDFBox is an open-source Java library for working with PDF documents. It allows for the creation, manipulation, and extraction of content from PDF files, as well as the ability to sign and validate PDF documents.
Pros
- Comprehensive PDF manipulation capabilities
- Active development and community support
- Well-documented API with extensive examples
- Free and open-source under the Apache License 2.0
Cons
- Performance can be slower compared to some commercial alternatives
- Limited support for advanced PDF features like 3D content
- Learning curve can be steep for complex operations
- Some font rendering issues with certain non-standard fonts
Code Examples
- Creating a simple PDF document:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.beginText();
contentStream.setFont(PDType1Font.HELVETICA, 12);
contentStream.newLineAtOffset(100, 700);
contentStream.showText("Hello, World!");
contentStream.endText();
contentStream.close();
document.save("hello_world.pdf");
document.close();
- Extracting text from a PDF:
PDDocument document = PDDocument.load(new File("input.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println(text);
document.close();
- Adding an image to a PDF:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDImageXObject image = PDImageXObject.createFromFile("image.jpg", document);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.drawImage(image, 100, 100);
contentStream.close();
document.save("document_with_image.pdf");
document.close();
Getting Started
To use Apache PDFBox in your Java project, add the following Maven dependency to your pom.xml
file:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.27</version>
</dependency>
For Gradle, add this to your build.gradle
file:
implementation 'org.apache.pdfbox:pdfbox:2.0.27'
After adding the dependency, you can start using PDFBox in your Java code by importing the necessary classes:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
Competitor Comparisons
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
Pros of OpenHTMLToPDF
- Specializes in HTML to PDF conversion, offering more accurate rendering of web content
- Supports CSS3 and many HTML5 features, providing better compatibility with modern web standards
- Easier to use for developers familiar with HTML and CSS
Cons of OpenHTMLToPDF
- Limited functionality beyond HTML to PDF conversion compared to PDFBox's broader PDF manipulation capabilities
- Smaller community and less frequent updates than PDFBox
Code Comparison
OpenHTMLToPDF:
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withUri("https://example.com");
builder.toStream(outputStream);
builder.run();
PDFBox:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.showText("Hello, World!");
contentStream.close();
document.save("output.pdf");
OpenHTMLToPDF is more focused on converting HTML to PDF, while PDFBox offers a wider range of PDF manipulation features. OpenHTMLToPDF is better suited for projects that primarily need to convert web content to PDF, while PDFBox is more versatile for general PDF operations.
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Pros of iText
- Generally faster performance for PDF generation and manipulation
- More comprehensive feature set, especially for complex PDF operations
- Better support for digital signatures and encryption
Cons of iText
- Stricter licensing terms (AGPL or commercial license required)
- Steeper learning curve due to more complex API
- Less frequent updates and community contributions
Code Comparison
PDFBox:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.beginText();
iText:
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(dest));
Document document = new Document(pdfDoc);
Paragraph paragraph = new Paragraph("Hello World!");
document.add(paragraph);
document.close();
Both PDFBox and iText are popular Java libraries for working with PDF files. PDFBox is open-source and Apache-licensed, making it more suitable for a wide range of projects. It has a larger community and more frequent updates. iText, while offering more advanced features and better performance, comes with stricter licensing terms and a steeper learning curve. The choice between the two depends on specific project requirements, budget constraints, and the complexity of PDF operations needed.
PDF Reader in JavaScript
Pros of pdf.js
- Written in JavaScript, making it easily integrable into web applications
- Renders PDFs directly in the browser, providing a seamless user experience
- Lightweight and doesn't require server-side processing
Cons of pdf.js
- Limited PDF manipulation capabilities compared to PDFBox
- May have performance issues with large or complex PDF files
- Lacks some advanced features like digital signatures and form filling
Code Comparison
PDFBox (Java):
PDDocument document = PDDocument.load(new File("example.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
document.close();
pdf.js (JavaScript):
pdfjsLib.getDocument('example.pdf').promise.then(function(pdf) {
pdf.getPage(1).then(function(page) {
page.getTextContent().then(function(textContent) {
console.log(textContent.items.map(item => item.str).join(' '));
});
});
});
Both examples demonstrate basic text extraction from a PDF file. PDFBox uses a more straightforward approach with fewer nested callbacks, while pdf.js relies on promises and may require additional setup for browser use. PDFBox offers more comprehensive PDF manipulation capabilities, whereas pdf.js excels in browser-based rendering and basic interactions.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Apache PDFBox
The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities. PDFBox is published under the Apache License, Version 2.0.
PDFBox is a project of the Apache Software Foundation.
Binary Downloads
You can download binary versions for releases currently under development or older releases from our Download Page.
Build
You need Java 11 (or higher) and Maven 3 to build PDFBox. The recommended build command is:
mvn clean install
The default build will compile the Java sources and package the binary classes into jar packages. See the Maven documentation for all the other available build options.
Contribute
There are various ways to help us improve PDFBox.
- look at the Issue Tracker to help us fix bugs.
- answer questions on our Users Mailing List.
- help us enhance the Examples
- help us to enhance the PDFBox Documentation or on GitHub.
Support
Please follow the guidelines at our Support Page.
If you have questions about how to use PDFBox do ask on the Users Mailing List. This will get you help from the entire community.
The PDFBox examples and the test code in the sources will also provide additional information.
And there are additional resources available on sites such as Stack Overflow.
If you are sure you have found a bug the please report the issue in our Issue Tracker.
Known Limitations and Problems
See the Issue Tracker for the full list of known issues and requested features. Some of the more common issues are:
-
You get text like "G38G43G36G51G5" instead of what you expect when you are extracting text. This is because the characters are a meaningless internal encoding that point to glyphs that are embedded in the PDF document. The only way to access the text is to use OCR. This may be a future enhancement.
-
You get an error message like
java.io.IOException: Can't handle font width
this MIGHT be due to the fact that you don't have the org/apache/pdfbox/resources directory in your classpath. The easiest solution is to include the apache-pdfbox-x.x.x.jar in your classpath. -
You get text that has the correct characters, but in the wrong order. This mght be because you have not enabled sorting. The text in PDF files is stored in chunks and the chunks do not need to be stored in the order that they are displayed on a page. By default, PDFBox does not sort the text.
License (see also LICENSE.txt)
Collective work: Copyright 2015 The Apache Software Foundation.
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Export control
This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See https://www.wassenaar.org/ for more information.
The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.
The following provides more details on the included cryptographic software:
Apache PDFBox uses the Java Cryptography Architecture (JCA) and the Bouncy Castle libraries for handling encryption in PDF documents.
Top Related Projects
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
PDF Reader in JavaScript
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot