heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

2,960

759

2,960

View on GitHub

Top Related Projects

pywb

1,481

Core Python Web Archiving Toolkit for replay and recording of web archives

Quick Overview

Heritrix3 is an open-source, extensible, web crawler developed by the Internet Archive. It is designed to crawl and archive web content for preservation and research purposes, allowing users to create customized web crawling workflows.

Pros

Extensible Architecture: Heritrix3 has a modular design, allowing users to easily extend its functionality by developing custom modules and plugins.
Scalable and Efficient: The crawler is capable of handling large-scale web crawling tasks, with features like distributed crawling and resource management.
Customizable Crawling Workflows: Users can configure various aspects of the crawling process, such as scope, politeness, and content selection, to suit their specific needs.
Robust Error Handling: Heritrix3 provides comprehensive error handling and reporting, making it easier to identify and address issues during the crawling process.

Cons

Steep Learning Curve: Configuring and using Heritrix3 can be complex, especially for users new to web crawling and archiving.
Limited Documentation: The project's documentation, while available, could be more comprehensive and user-friendly.
Outdated User Interface: The web-based user interface of Heritrix3 is somewhat dated and may not meet the expectations of modern web applications.
Dependency on Java: Heritrix3 is written in Java, which may be a limitation for users who prefer other programming languages.

Getting Started

To get started with Heritrix3, follow these steps:

Download the latest version of Heritrix3 from the GitHub repository.
Extract the downloaded archive and navigate to the heritrix3 directory.
Run the heritrix script (or heritrix.bat on Windows) to start the Heritrix3 web interface.
Open a web browser and navigate to http://localhost:8443/heritrix to access the Heritrix3 web interface.
Create a new job by clicking on the "Jobs" tab and then the "New Job" button.
Configure the job settings, such as the crawl scope, politeness, and content selection, according to your requirements.
Start the crawl by clicking the "Start" button.

Heritrix3 provides extensive configuration options and features, so it's recommended to review the project's documentation to learn more about advanced usage and customization.

Competitor Comparisons

pywb

1,481

Core Python Web Archiving Toolkit for replay and recording of web archives

Pros of pywb

pywb is a more lightweight and flexible web archiving solution compared to Heritrix3, making it easier to integrate into custom applications.
pywb supports a wider range of web content types, including WARC, ARC, and CDX formats, providing more flexibility in handling archived data.
pywb has a more modern and user-friendly web interface, making it easier for users to interact with and manage their web archives.

Cons of pywb

Heritrix3 is a more mature and feature-rich web crawler, with a larger user community and more extensive documentation.
Heritrix3 is better suited for large-scale web crawling and archiving projects, with more advanced scheduling and configuration options.
Heritrix3 has a more robust and reliable crawling engine, which may be important for mission-critical web archiving applications.

Code Comparison

Heritrix3 (Java):

public class CrawlController extends Controller {
    public void run() {
        try {
            initializeResources();
            runCrawl();
        } catch (Exception e) {
            logger.error("Crawl failed", e);
        } finally {
            cleanupResources();
        }
    }
}

pywb (Python):

def replay_request(self, env, start_response):
    try:
        return self.application(env, start_response)
    except Exception as e:
        self.logger.error('Error replaying request: %s', e)
        start_response('500 Internal Server Error', [('Content-Type', 'text/plain')])
        return [b'Internal Server Error']

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Heritrix

Introduction

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

Crawl Operators!

Heritrix is designed to respect the robots.txt exclusion directives^â and META nofollow tags. Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the User-Agent so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.

^â The newer wildcard extension to robots.txt is not yet supported.

Documentation

Developer Documentation

Developer Manual
REST API documentation
JavaDoc: engine, modules, commons, contrib

Latest Releases

Information about releases can be found here.

License

Heritrix is free software; you can redistribute it and/or modify it under the terms of the Apache License, Version 2.0

Some individual source code files are subject to or offered under other licenses. See the included LICENSE.txt file for more information.

Heritrix is distributed with the libraries it depends upon. The libraries can be found under the lib directory in the release distribution, and are used under the terms of their respective licenses, which are included alongside the libraries in the lib directory.

Top Related Projects

pywb

1,481

Core Python Web Archiving Toolkit for replay and recording of web archives

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot