tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
Top Related Projects
Transforms PDF, Documents and Images into Enriched Structured Data
Tabula is a tool for liberating data tables trapped inside PDF files
Tesseract Open Source OCR Engine (main repository)
Universal Office Converter - Convert between any document format supported by LibreOffice/OpenOffice.
Quick Overview
Apache Tika is a content detection and analysis framework that can extract metadata and text content from a wide variety of file formats, including common document formats, spreadsheets, presentations, images, audio, and video. It is designed to be a robust and flexible tool for working with unstructured data.
Pros
- Broad File Format Support: Tika can handle a wide range of file formats, making it a versatile tool for working with diverse data sources.
- Metadata Extraction: Tika can extract metadata from files, providing valuable information about the content and its origins.
- Text Extraction: Tika can extract the textual content from files, enabling text-based analysis and processing.
- Extensibility: Tika is designed to be extensible, allowing developers to add support for new file formats or customize its behavior to fit their specific needs.
Cons
- Complexity: Tika is a feature-rich library, which can make it challenging for new users to get started and understand all of its capabilities.
- Performance: Depending on the file size and format, Tika's processing can be resource-intensive, which may impact performance in some use cases.
- Dependency Management: Tika relies on a large number of external libraries, which can make dependency management and version compatibility a potential issue.
- Limited GUI: Tika is primarily a command-line and programmatic tool, and it lacks a robust graphical user interface (GUI) for non-technical users.
Code Examples
Here are a few examples of how to use Apache Tika in Java:
// Extract text from a file
try (InputStream inputStream = new FileInputStream("example.pdf")) {
Tika tika = new Tika();
String text = tika.parseToString(inputStream);
System.out.println(text);
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// Extract metadata from a file
try (InputStream inputStream = new FileInputStream("example.jpg")) {
Metadata metadata = new Metadata();
Tika tika = new Tika();
tika.parse(inputStream, metadata);
for (String name : metadata.names()) {
System.out.println(name + ": " + metadata.get(name));
}
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// Detect the file type
try (InputStream inputStream = new FileInputStream("example.docx")) {
Tika tika = new Tika();
String detectedType = tika.detect(inputStream);
System.out.println("File type: " + detectedType);
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// Parse a file using a specific parser
try (InputStream inputStream = new FileInputStream("example.xlsx")) {
Tika tika = new Tika();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
tika.getParser().parse(inputStream, handler, metadata, context);
System.out.println("Text content: " + handler.toString());
System.out.println("Metadata: " + metadata);
} catch (IOException | TikaException e) {
e.printStackTrace();
}
Getting Started
To get started with Apache Tika, you can follow these steps:
- Add the Tika dependency to your project. For example, in a Maven-based Java project, you can add the following dependency to your
pom.xml
file:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
- Import the necessary Tika classes in your Java code:
import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.
Competitor Comparisons
Transforms PDF, Documents and Images into Enriched Structured Data
Pros of Parsr
- Focuses on document reconstruction and layout analysis, providing more structured output
- Offers a web-based UI for visualizing and editing extracted content
- Supports multiple output formats, including JSON, markdown, and text
Cons of Parsr
- More specialized for document processing, while Tika is a general-purpose content extraction library
- Smaller community and less frequent updates compared to Tika
- Limited language support for text extraction compared to Tika's extensive language capabilities
Code Comparison
Parsr (JavaScript):
const Parsr = require('parsr');
const parser = new Parsr();
parser.run('document.pdf', { output_format: 'json' })
.then(result => console.log(result))
.catch(error => console.error(error));
Tika (Java):
import org.apache.tika.Tika;
Tika tika = new Tika();
String content = tika.parseToString(new File("document.pdf"));
System.out.println(content);
Both libraries provide methods for parsing documents, but Parsr offers more options for output formatting and document structure analysis, while Tika focuses on content extraction across a wide range of file formats.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of Tabula
- Specialized in extracting tables from PDFs, offering more precise table extraction
- User-friendly GUI for those who prefer visual interaction
- Supports command-line usage for automation and integration
Cons of Tabula
- Limited to table extraction from PDFs, while Tika handles various file formats
- Less actively maintained compared to Tika's frequent updates
- Smaller community and fewer contributors
Code Comparison
Tabula (Ruby):
require 'tabula'
pdf_path = "sample.pdf"
Tabula.extract_tables(pdf_path).each do |table|
puts table
end
Tika (Java):
import org.apache.tika.Tika;
Tika tika = new Tika();
String content = tika.parseToString(new File("sample.pdf"));
System.out.println(content);
Summary
Tabula excels in extracting tables from PDFs with a user-friendly interface, while Tika offers broader file format support and content extraction capabilities. Tika is more actively maintained and has a larger community. Tabula is ideal for specific table extraction tasks, whereas Tika is better suited for general-purpose content extraction across various file formats.
Tesseract Open Source OCR Engine (main repository)
Pros of Tesseract
- Specialized in Optical Character Recognition (OCR), offering more advanced and accurate text extraction from images
- Supports a wide range of languages and can be trained for new ones
- Provides low-level control over OCR processes, allowing for fine-tuning and customization
Cons of Tesseract
- Limited to OCR functionality, lacking the broad file format support and content extraction capabilities of Tika
- Requires more setup and configuration for optimal performance
- May have a steeper learning curve for users unfamiliar with OCR concepts
Code Comparison
Tesseract (Python binding):
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open('image.png'))
print(text)
Tika (Java):
import org.apache.tika.Tika;
Tika tika = new Tika();
String text = tika.parseToString(new File("document.pdf"));
System.out.println(text);
While Tesseract focuses on extracting text from images, Tika provides a more general-purpose content extraction API for various file formats. Tesseract offers more specialized OCR capabilities, while Tika excels in handling a wide range of document types and metadata extraction.
Universal Office Converter - Convert between any document format supported by LibreOffice/OpenOffice.
Pros of unoconv
- Specialized in document conversion, particularly for LibreOffice/OpenOffice formats
- Lightweight and focused on a specific task
- Command-line interface for easy integration into scripts and workflows
Cons of unoconv
- Limited to document conversion, lacks broader content analysis capabilities
- Depends on LibreOffice/OpenOffice installation
- Less active development and smaller community compared to Tika
Code comparison
unoconv:
from unoconv import convert
convert('document.docx', 'output.pdf')
Tika:
InputStream input = new FileInputStream("document.docx");
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(input, handler, metadata, context);
Summary
unoconv is a specialized tool for document conversion, particularly useful for LibreOffice/OpenOffice formats. It offers a simple command-line interface and is lightweight. However, it has limited functionality compared to Tika, which provides broader content analysis and metadata extraction capabilities. Tika has a larger community and more active development, making it more suitable for complex content processing tasks. The choice between the two depends on the specific requirements of the project and the desired level of functionality.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Welcome to Apache Tika https://tika.apache.org/
Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Tika is a project of the Apache Software Foundation.
Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software Foundation.
Getting Started
Pre-built binaries of Apache Tika standalone applications are available from https://tika.apache.org/download.html . Pre-built binaries of all the Tika jars can be fetched from Maven Central or your favourite Maven mirror.
Tika 2.X and support for Java 8 is planned to reach End of Life (EOL) in April, 2025. See Tika Roadmap 2.x, 3.x and beyond.
Tika is based on Java 17 and uses the Maven 3 build system. N.B. Docker is used for tests in tika-integration-tests. As of Tika 2.5.1, if Docker is not installed, those tests are skipped. Docker is required for a successful build on earlier 2.x versions.
To build Tika from source, use the following command in the main directory:
mvn clean install
The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this:
java -jar tika-app/target/tika-app-*.jar --help
To build a specific project (for example, tika-server-standard):
mvn clean install -am -pl :tika-server-standard
If the ossindex-maven-plugin is causing the build to fail because a dependency has now been discovered to have a vulnerability:
mvn clean install -Dossindex.skip
Maven Dependencies
Apache Tika provides Bill of Material (BOM) artifact to align Tika module versions and simplify version management. To avoid convergence errors in your own project, import this bom or Tika's parent pom.xml in your dependency management section.
If you use Apache Maven:
<project>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>4.x.y</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<!-- version not required since BOM included -->
</dependency>
</dependencies>
</project>
For Gradle:
dependencies {
implementation(platform("org.apache.tika:tika-bom:4.x.y"))
// version not required since bom (platform in Gradle terms)
implementation("org.apache.tika:tika-parsers-standard-package")
}
Migrating to 4.x
TBD
Contributing via Github
See the pull request template.
NOTE: Please open pull requests against the main
branch. We locked master
in September 2020 and no longer use it.
Thanks to all the people who have contributed
Building from a Specific Tag
Let's assume that you want to build the 3.0.1 tag:
0. Download and install hub.github.com
1. git clone https://github.com/apache/tika.git
2. cd tika
3. git checkout 3.0.1
4. mvn clean install
If a new vulnerability has been discovered between the date of the tag and the date you are building the tag, you may need to build with:
4. mvn clean install -Dossindex.skip
If a local test is not working in your environment, please notify the project at dev@tika.apache.org. As an immediate workaround, you can turn off individual tests with e.g.:
4. mvn clean install -Dossindex.skip -Dtest=\!UnpackerResourceTest#testPDFImages
License (see also LICENSE.txt)
Collective work: Copyright 2011 The Apache Software Foundation.
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Apache Tika includes a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the licenses listed in the LICENSE.txt file.
Export Control
This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.
The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.
The following provides more details on the included cryptographic software:
Apache Tika uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.
Mailing Lists
Discussion about Tika takes place on the following mailing lists:
- user@tika.apache.org - About using Tika
- dev@tika.apache.org - About developing Tika
Notification on all code changes are sent to the following mailing list:
The mailing lists are open to anyone and publicly archived.
You can subscribe the mailing lists by sending a message to
[LIST]-subscribe@tika.apache.org (for example, user-subscribe@...).
To unsubscribe, send a message to [LIST]-unsubscribe@tika.apache.org.
For more instructions, send a message to [LIST]-help@tika.apache.org.
Issue Tracker
If you encounter errors in Tika or want to suggest an improvement or a new feature, please visit the Tika issue tracker. There you can also find the latest information on known issues and recent bug fixes and enhancements.
Build Issues
TODO
-
Need to install jce
-
If you find any other issues while building, please email the dev@tika.apache.org list.
Top Related Projects
Transforms PDF, Documents and Images into Enriched Structured Data
Tabula is a tool for liberating data tables trapped inside PDF files
Tesseract Open Source OCR Engine (main repository)
Universal Office Converter - Convert between any document format supported by LibreOffice/OpenOffice.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot