heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Top Related Projects
Core Python Web Archiving Toolkit for replay and recording of web archives
Quick Overview
Heritrix3 is an open-source, extensible, web crawler developed by the Internet Archive. It is designed to crawl and archive web content for preservation and research purposes, allowing users to create customized web crawling workflows.
Pros
- Extensible Architecture: Heritrix3 has a modular design, allowing users to easily extend its functionality by developing custom modules and plugins.
- Scalable and Efficient: The crawler is capable of handling large-scale web crawling tasks, with features like distributed crawling and resource management.
- Customizable Crawling Workflows: Users can configure various aspects of the crawling process, such as scope, politeness, and content selection, to suit their specific needs.
- Robust Error Handling: Heritrix3 provides comprehensive error handling and reporting, making it easier to identify and address issues during the crawling process.
Cons
- Steep Learning Curve: Configuring and using Heritrix3 can be complex, especially for users new to web crawling and archiving.
- Limited Documentation: The project's documentation, while available, could be more comprehensive and user-friendly.
- Outdated User Interface: The web-based user interface of Heritrix3 is somewhat dated and may not meet the expectations of modern web applications.
- Dependency on Java: Heritrix3 is written in Java, which may be a limitation for users who prefer other programming languages.
Getting Started
To get started with Heritrix3, follow these steps:
- Download the latest version of Heritrix3 from the GitHub repository.
- Extract the downloaded archive and navigate to the
heritrix3
directory. - Run the
heritrix
script (orheritrix.bat
on Windows) to start the Heritrix3 web interface. - Open a web browser and navigate to
http://localhost:8443/heritrix
to access the Heritrix3 web interface. - Create a new job by clicking on the "Jobs" tab and then the "New Job" button.
- Configure the job settings, such as the crawl scope, politeness, and content selection, according to your requirements.
- Start the crawl by clicking the "Start" button.
Heritrix3 provides extensive configuration options and features, so it's recommended to review the project's documentation to learn more about advanced usage and customization.
Competitor Comparisons
Core Python Web Archiving Toolkit for replay and recording of web archives
Pros of pywb
- pywb is a more lightweight and flexible web archiving solution compared to Heritrix3, making it easier to integrate into custom applications.
- pywb supports a wider range of web content types, including WARC, ARC, and CDX formats, providing more flexibility in handling archived data.
- pywb has a more modern and user-friendly web interface, making it easier for users to interact with and manage their web archives.
Cons of pywb
- Heritrix3 is a more mature and feature-rich web crawler, with a larger user community and more extensive documentation.
- Heritrix3 is better suited for large-scale web crawling and archiving projects, with more advanced scheduling and configuration options.
- Heritrix3 has a more robust and reliable crawling engine, which may be important for mission-critical web archiving applications.
Code Comparison
Heritrix3 (Java):
public class CrawlController extends Controller {
public void run() {
try {
initializeResources();
runCrawl();
} catch (Exception e) {
logger.error("Crawl failed", e);
} finally {
cleanupResources();
}
}
}
pywb (Python):
def replay_request(self, env, start_response):
try:
return self.application(env, start_response)
except Exception as e:
self.logger.error('Error replaying request: %s', e)
start_response('500 Internal Server Error', [('Content-Type', 'text/plain')])
return [b'Internal Server Error']
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Heritrix
Introduction
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.
Crawl Operators!
Heritrix is designed to respect the robots.txt
exclusion directivesâ and META nofollow tags. Please consider the
load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the User-Agent
so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.
â The newer wildcard extension to robots.txt is not yet supported.
Documentation
Developer Documentation
- Developer Manual
- REST API documentation
- JavaDoc: engine, modules, commons, contrib
Latest Releases
Information about releases can be found here.
License
Heritrix is free software; you can redistribute it and/or modify it under the terms of the Apache License, Version 2.0
Some individual source code files are subject to or offered under other licenses. See the included LICENSE.txt
file for more information.
Heritrix is distributed with the libraries it depends upon. The libraries can be found under the lib
directory in the release distribution, and are used under the terms of their respective licenses, which are included alongside the libraries in the lib
directory.
Top Related Projects
Core Python Web Archiving Toolkit for replay and recording of web archives
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot