tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

3,083

830

3,083

View on GitHub

Top Related Projects

Parsr

5,971

Transforms PDF, Documents and Images into Enriched Structured Data

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

tesseract

66,432

Tesseract Open Source OCR Engine (main repository)

unoconv

2,713

Universal Office Converter - Convert between any document format supported by LibreOffice/OpenOffice.

Quick Overview

Apache Tika is a content detection and analysis framework that can extract metadata and text content from a wide variety of file formats, including common document formats, spreadsheets, presentations, images, audio, and video. It is designed to be a robust and flexible tool for working with unstructured data.

Pros

Broad File Format Support: Tika can handle a wide range of file formats, making it a versatile tool for working with diverse data sources.
Metadata Extraction: Tika can extract metadata from files, providing valuable information about the content and its origins.
Text Extraction: Tika can extract the textual content from files, enabling text-based analysis and processing.
Extensibility: Tika is designed to be extensible, allowing developers to add support for new file formats or customize its behavior to fit their specific needs.

Cons

Complexity: Tika is a feature-rich library, which can make it challenging for new users to get started and understand all of its capabilities.
Performance: Depending on the file size and format, Tika's processing can be resource-intensive, which may impact performance in some use cases.
Dependency Management: Tika relies on a large number of external libraries, which can make dependency management and version compatibility a potential issue.
Limited GUI: Tika is primarily a command-line and programmatic tool, and it lacks a robust graphical user interface (GUI) for non-technical users.

Code Examples

Here are a few examples of how to use Apache Tika in Java:

// Extract text from a file
try (InputStream inputStream = new FileInputStream("example.pdf")) {
    Tika tika = new Tika();
    String text = tika.parseToString(inputStream);
    System.out.println(text);
} catch (IOException | TikaException e) {
    e.printStackTrace();
}

// Extract metadata from a file
try (InputStream inputStream = new FileInputStream("example.jpg")) {
    Metadata metadata = new Metadata();
    Tika tika = new Tika();
    tika.parse(inputStream, metadata);
    for (String name : metadata.names()) {
        System.out.println(name + ": " + metadata.get(name));
    }
} catch (IOException | TikaException e) {
    e.printStackTrace();
}

// Detect the file type
try (InputStream inputStream = new FileInputStream("example.docx")) {
    Tika tika = new Tika();
    String detectedType = tika.detect(inputStream);
    System.out.println("File type: " + detectedType);
} catch (IOException | TikaException e) {
    e.printStackTrace();
}

// Parse a file using a specific parser
try (InputStream inputStream = new FileInputStream("example.xlsx")) {
    Tika tika = new Tika();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    tika.getParser().parse(inputStream, handler, metadata, context);
    System.out.println("Text content: " + handler.toString());
    System.out.println("Metadata: " + metadata);
} catch (IOException | TikaException e) {
    e.printStackTrace();
}

Getting Started

To get started with Apache Tika, you can follow these steps:

Add the Tika dependency to your project. For example, in a Maven-based Java project, you can add the following dependency to your pom.xml file:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.6.0</version>
</dependency>

Import the necessary Tika classes in your Java code:

import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.

Competitor Comparisons

Parsr

5,971

Transforms PDF, Documents and Images into Enriched Structured Data

Pros of Parsr

Focuses on document reconstruction and layout analysis, providing more structured output
Offers a web-based UI for visualizing and editing extracted content
Supports multiple output formats, including JSON, markdown, and text

Cons of Parsr

More specialized for document processing, while Tika is a general-purpose content extraction library
Smaller community and less frequent updates compared to Tika
Limited language support for text extraction compared to Tika's extensive language capabilities

Code Comparison

Parsr (JavaScript):

const Parsr = require('parsr');
const parser = new Parsr();
parser.run('document.pdf', { output_format: 'json' })
  .then(result => console.log(result))
  .catch(error => console.error(error));

Tika (Java):

import org.apache.tika.Tika;
Tika tika = new Tika();
String content = tika.parseToString(new File("document.pdf"));
System.out.println(content);

Both libraries provide methods for parsing documents, but Parsr offers more options for output formatting and document structure analysis, while Tika focuses on content extraction across a wide range of file formats.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

Specialized in extracting tables from PDFs, offering more precise table extraction
User-friendly GUI for those who prefer visual interaction
Supports command-line usage for automation and integration

Cons of Tabula

Limited to table extraction from PDFs, while Tika handles various file formats
Less actively maintained compared to Tika's frequent updates
Smaller community and fewer contributors

Code Comparison

Tabula (Ruby):

require 'tabula'
pdf_path = "sample.pdf"
Tabula.extract_tables(pdf_path).each do |table|
  puts table
end

Tika (Java):

import org.apache.tika.Tika;
Tika tika = new Tika();
String content = tika.parseToString(new File("sample.pdf"));
System.out.println(content);

Summary

Tabula excels in extracting tables from PDFs with a user-friendly interface, while Tika offers broader file format support and content extraction capabilities. Tika is more actively maintained and has a larger community. Tabula is ideal for specific table extraction tasks, whereas Tika is better suited for general-purpose content extraction across various file formats.

tesseract

66,432

Tesseract Open Source OCR Engine (main repository)

Pros of Tesseract

Specialized in Optical Character Recognition (OCR), offering more advanced and accurate text extraction from images
Supports a wide range of languages and can be trained for new ones
Provides low-level control over OCR processes, allowing for fine-tuning and customization

Cons of Tesseract

Limited to OCR functionality, lacking the broad file format support and content extraction capabilities of Tika
Requires more setup and configuration for optimal performance
May have a steeper learning curve for users unfamiliar with OCR concepts

Code Comparison

Tesseract (Python binding):

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'))
print(text)

Tika (Java):

import org.apache.tika.Tika;

Tika tika = new Tika();
String text = tika.parseToString(new File("document.pdf"));
System.out.println(text);

While Tesseract focuses on extracting text from images, Tika provides a more general-purpose content extraction API for various file formats. Tesseract offers more specialized OCR capabilities, while Tika excels in handling a wide range of document types and metadata extraction.

unoconv

2,713

Universal Office Converter - Convert between any document format supported by LibreOffice/OpenOffice.

Pros of unoconv

Specialized in document conversion, particularly for LibreOffice/OpenOffice formats
Lightweight and focused on a specific task
Command-line interface for easy integration into scripts and workflows

Cons of unoconv

Limited to document conversion, lacks broader content analysis capabilities
Depends on LibreOffice/OpenOffice installation
Less active development and smaller community compared to Tika

Code comparison

unoconv:

from unoconv import convert
convert('document.docx', 'output.pdf')

Tika:

InputStream input = new FileInputStream("document.docx");
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(input, handler, metadata, context);

Summary

unoconv is a specialized tool for document conversion, particularly useful for LibreOffice/OpenOffice formats. It offers a simple command-line interface and is lightweight. However, it has limited functionality compared to Tika, which provides broader content analysis and metadata extraction capabilities. Tika has a larger community and more active development, making it more suitable for complex content processing tasks. The choice between the two depends on the specific requirements of the project and the desired level of functionality.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Welcome to Apache Tika https://tika.apache.org/

Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tika is a project of the Apache Software Foundation.

Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software Foundation.

Getting Started

Pre-built binaries of Apache Tika standalone applications are available from https://tika.apache.org/download.html . Pre-built binaries of all the Tika jars can be fetched from Maven Central or your favourite Maven mirror.

Tika 2.X and support for Java 8 is planned to reach End of Life (EOL) in April, 2025. See Tika Roadmap 2.x, 3.x and beyond.

Tika is based on Java 17 and uses the Maven 3 build system. N.B. Docker is used for tests in tika-integration-tests. As of Tika 2.5.1, if Docker is not installed, those tests are skipped. Docker is required for a successful build on earlier 2.x versions.

To build Tika from source, use the following command in the main directory:

mvn clean install

The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this:

java -jar tika-app/target/tika-app-*.jar --help

To build a specific project (for example, tika-server-standard):

mvn clean install -am -pl :tika-server-standard

If the ossindex-maven-plugin is causing the build to fail because a dependency has now been discovered to have a vulnerability:

mvn clean install -Dossindex.skip

Maven Dependencies

Apache Tika provides Bill of Material (BOM) artifact to align Tika module versions and simplify version management. To avoid convergence errors in your own project, import this bom or Tika's parent pom.xml in your dependency management section.

If you use Apache Maven:

<project>
  <dependencyManagement>
    <dependencies>
      <dependency>
       <groupId>org.apache.tika</groupId>
       <artifactId>tika-bom</artifactId>
       <version>4.x.y</version>
       <type>pom</type>
       <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>

  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers-standard-package</artifactId>
      <!-- version not required since BOM included -->
    </dependency>
  </dependencies>
</project>

For Gradle:

dependencies {
  implementation(platform("org.apache.tika:tika-bom:4.x.y"))

  // version not required since bom (platform in Gradle terms)
  implementation("org.apache.tika:tika-parsers-standard-package")
}

Migrating to 4.x

TBD

Contributing via Github

See the pull request template.

NOTE: Please open pull requests against the main branch. We locked master in September 2020 and no longer use it.

Thanks to all the people who have contributed

Building from a Specific Tag

Let's assume that you want to build the 3.0.1 tag:

0. Download and install hub.github.com
1. git clone https://github.com/apache/tika.git 
2. cd tika
3. git checkout 3.0.1
4. mvn clean install

If a new vulnerability has been discovered between the date of the tag and the date you are building the tag, you may need to build with:

4. mvn clean install -Dossindex.skip

If a local test is not working in your environment, please notify the project at dev@tika.apache.org. As an immediate workaround, you can turn off individual tests with e.g.:

4. mvn clean install -Dossindex.skip -Dtest=\!UnpackerResourceTest#testPDFImages

License (see also LICENSE.txt)

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache Tika includes a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the licenses listed in the LICENSE.txt file.

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Tika uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

Mailing Lists

Discussion about Tika takes place on the following mailing lists:

user@tika.apache.org - About using Tika
dev@tika.apache.org - About developing Tika

Notification on all code changes are sent to the following mailing list:

commits@tika.apache.org

The mailing lists are open to anyone and publicly archived.

You can subscribe the mailing lists by sending a message to [LIST]-subscribe@tika.apache.org (for example, user-subscribe@...).
To unsubscribe, send a message to [LIST]-unsubscribe@tika.apache.org.
For more instructions, send a message to [LIST]-help@tika.apache.org.

Issue Tracker

If you encounter errors in Tika or want to suggest an improvement or a new feature, please visit the Tika issue tracker. There you can also find the latest information on known issues and recent bug fixes and enhancements.

Build Issues

TODO

Need to install jce
If you find any other issues while building, please email the dev@tika.apache.org list.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot