Top Related Projects
List of libraries, tools and APIs for web scraping and data processing.
爬虫集合
a curated list of awesome streaming frameworks, applications, etc
The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.
Python best practices guidebook, written for humans.
Quick Overview
Awesome-crawler is a curated list of web crawling resources, tools, and libraries. It provides a comprehensive collection of open-source crawlers, frameworks, and related technologies for various programming languages. This repository serves as a valuable reference for developers interested in web scraping and data extraction.
Pros
- Extensive collection of crawling resources across multiple programming languages
- Well-organized and categorized list, making it easy to find relevant tools
- Regularly updated with new additions and community contributions
- Includes both popular and lesser-known crawling tools, providing a wide range of options
Cons
- Lacks detailed descriptions or comparisons of the listed tools
- Some links may become outdated over time if not regularly maintained
- Does not provide guidance on best practices or ethical considerations in web crawling
- May overwhelm beginners due to the large number of options presented
Code Examples
This repository is not a code library but a curated list of resources. Therefore, there are no code examples to provide.
Getting Started
As this is not a code library, there are no specific getting started instructions. However, users can explore the repository by visiting the GitHub page and browsing through the categorized list of crawling resources. To contribute to the project, users can follow these steps:
- Fork the repository
- Add new resources or update existing ones
- Submit a pull request with a clear description of the changes
For more detailed information, refer to the repository's README and contribution guidelines.
Competitor Comparisons
List of libraries, tools and APIs for web scraping and data processing.
Pros of awesome-web-scraping
- More comprehensive and organized categorization of tools and resources
- Includes a wider range of programming languages and frameworks
- Regularly updated with new contributions and resources
Cons of awesome-web-scraping
- Less focus on specific crawler implementations
- Fewer examples of complete crawler projects
- May be overwhelming for beginners due to the extensive list of resources
Code Comparison
While both repositories primarily serve as curated lists, they don't typically include code snippets. However, they might reference code examples from other sources. Here's a hypothetical comparison of how they might present a simple web scraping example:
awesome-crawler:
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
awesome-web-scraping:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
# Extract data here
pass
Both repositories serve as valuable resources for web scraping and crawling, with awesome-web-scraping offering a broader scope and awesome-crawler providing a more focused approach on crawler implementations.
爬虫集合
Pros of awesome-spider
- More focused on Chinese resources and tools
- Includes a section on anti-crawler techniques
- Provides links to relevant books and tutorials
Cons of awesome-spider
- Less frequently updated compared to awesome-crawler
- Fewer international resources and tools
- Smaller overall collection of links and resources
Code Comparison
While both repositories are primarily curated lists of resources rather than code repositories, they do include some code snippets in their documentation. Here's a brief comparison:
awesome-crawler:
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
# Scraping logic here
awesome-spider:
import requests
from bs4 import BeautifulSoup
url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Parsing logic here
Both repositories provide basic examples of web scraping, with awesome-crawler focusing on the Scrapy framework and awesome-spider showcasing a more general approach using requests and BeautifulSoup.
a curated list of awesome streaming frameworks, applications, etc
Pros of awesome-streaming
- Focuses specifically on streaming data processing, providing a more specialized resource
- Includes a wider range of categories, such as streaming databases, messaging systems, and benchmarking tools
- Offers a more comprehensive list of streaming-related frameworks and libraries
Cons of awesome-streaming
- Less frequently updated compared to awesome-crawler
- Has fewer contributors and stars on GitHub
- Lacks detailed descriptions for some of the listed resources
Code comparison
While both repositories are curated lists and don't contain significant code, here's a comparison of their README structures:
awesome-streaming:
## Streaming Processing
### Distributed Streaming Platforms
- [Apache Flink](https://flink.apache.org/) - Distributed stream and batch data processing framework.
- [Apache Spark Streaming](https://spark.apache.org/streaming/) - Micro-batch processing for Spark.
awesome-crawler:
## Python
* [scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework.
* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
Both repositories use similar markdown structures, but awesome-streaming organizes content into more specific categories, while awesome-crawler groups resources primarily by programming language.
The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.
Pros of big-list-of-naughty-strings
- Focused on a specific use case (input validation testing)
- Regularly updated with new edge cases and potential vulnerabilities
- Easily integrable into various testing frameworks and languages
Cons of big-list-of-naughty-strings
- Limited scope compared to the broader crawler resources in awesome-crawler
- Less versatile for general web scraping and data extraction tasks
- Requires additional tools or frameworks for practical implementation
Code Comparison
big-list-of-naughty-strings:
with open('blns.txt') as f:
strings = [l.strip() for l in f.readlines()]
awesome-crawler (example from a linked resource):
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
yield {'title': response.css('h1::text').get()}
The big-list-of-naughty-strings code snippet demonstrates loading the list of problematic strings, while the awesome-crawler example shows a basic web scraping setup using Scrapy. This highlights the different focus areas of the two repositories: input testing vs. web crawling.
Python best practices guidebook, written for humans.
Pros of python-guide
- Comprehensive guide covering various Python topics beyond web scraping
- Well-organized structure with clear sections for different aspects of Python development
- Regularly updated with contributions from the Python community
Cons of python-guide
- Less focused on specific web scraping techniques and tools
- May not provide as many direct links to crawler-related libraries and frameworks
Code comparison
python-guide example (configuration):
import configparser
config = configparser.ConfigParser()
config['DEFAULT'] = {'ServerAliveInterval': '45',
'Compression': 'yes',
'CompressionLevel': '9'}
awesome-crawler example (web scraping):
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
Summary
python-guide is a comprehensive resource for Python developers, covering a wide range of topics and best practices. It's well-maintained and regularly updated, making it valuable for both beginners and experienced developers.
awesome-crawler, on the other hand, is more focused on web scraping and crawling techniques. It provides a curated list of tools, libraries, and resources specifically for web scraping tasks.
While python-guide offers a broader perspective on Python development, awesome-crawler is more suitable for those looking for specialized information on web crawling and scraping techniques.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Awesome-crawler
A collection of awesome web crawler,spider and resources in different languages.
Contents
Python
- Scrapy - A fast high-level screen scraping and web crawling framework.
- django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.
- Scrapy-Redis - Redis-based components for Scrapy.
- scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.
- distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
- pyspider - A powerful spider system.
- CoCrawler - A versatile web crawler built using modern tools and concurrency.
- cola - A distributed crawling framework.
- Demiurge - PyQuery-based scraping micro-framework.
- Scrapely - A pure-python HTML screen-scraping library.
- feedparser - Universal feed parser.
- you-get - Dumb downloader that scrapes the web.
- MechanicalSoup - A Python library for automating interaction with websites.
- portia - Visual scraping for Scrapy.
- crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
- RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
- MSpider - A simple ,easy spider using gevent and js render.
- brownant - A lightweight web data extracting framework.
- PSpider - A simple spider frame in Python3.
- Gain - Web crawling framework based on asyncio for everyone.
- sukhoi - Minimalist and powerful Web Crawler.
- spidy - The simple, easy to use command line web crawler.
- newspaper - News, full-text, and article metadata extraction in Python 3
- aspider - An async web scraping micro-framework based on asyncio.
Java
- ACHE Crawler - An easy to use web crawler for domain-specific search.
- Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
- anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
- Crawler4j - Simple and lightweight web crawler.
- JSoup - Scrapes, parses, manipulates and cleans HTML.
- websphinx - Website-Specific Processors for HTML information extraction.
- Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- Gecco - A easy to use lightweight web crawler
- WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
- Webmagic - A scalable crawler framework.
- Spiderman - A scalable ,extensible, multi-threaded web crawler.
- Spiderman2 - A distributed web crawler framework,support js render.
- Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
- SeimiCrawler - An agile, distributed crawler framework.
- StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
- Spark-Crawler - Evolving Apache Nutch to run on Spark.
- webBee - A DFS web spider.
- spider-flow - A visual spider framework, it's so good that you don't need to write any code to crawl the website.
- Norconex Web Crawler - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.
C#
- ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
- SimpleCrawler - Simple spider base on mutithreading, regluar expression.
- DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
- Abot - C# web crawler built for speed and flexibility.
- Hawk - Advanced Crawler and ETL tool written in C#/WPF.
- SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
- Infinity Crawler - A simple but powerful web crawler library in C#.
JavaScript
- scraperjs - A complete and versatile web scraper.
- scrape-it - A Node.js scraper for humans.
- simplecrawler - Event driven web crawler.
- node-crawler - Node-crawler has clean,simple api.
- js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
- webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
- x-ray - Web scraper with pagination and crawler support.
- node-osmosis - HTML/XML parser and web scraper for Node.js.
- web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
- supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
- headless-chrome-crawler - Headless Chrome crawls with jQuery support
- Squidwarc - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
- crawlee - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
PHP
- Goutte - A screen scraping and web crawling library for PHP.
- laravel-goutte - Laravel 5 Facade for Goutte.
- dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
- QueryList - The progressive PHP crawler framework.
- pspider - Parallel web crawler written in PHP.
- php-spider - A configurable and extensible PHP web spider.
- spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
- crawlzone/crawlzone - Crawlzone is a fast asynchronous internet crawling framework for PHP.
- PHPScraper - PHPScraper is a scraper & crawler built for simplicity.
C++
- open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.
C
- httrack - Copy websites to your computer.
Ruby
- Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
- upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
- wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
- RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
- Spidr - Spider a site, multiple domains, certain links or infinitely.
- Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
- mechanize - Automated web interaction & crawling.
Rust
- spider - The fastest web crawler and indexer.
- crawler - A gRPC web indexer turbo charged for performance.
R
- rvest - Simple web scraping for R.
Erlang
- ebot - A scalable, distribuited and highly configurable web cawler.
Perl
- web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
Go
- pholcus - A distributed, high concurrency and powerful web crawler.
- gocrawl - Polite, slim and concurrent web crawler.
- fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
- go_spider - An awesome Go concurrent Crawler(spider) framework.
- dht - BitTorrent DHT Protocol && DHT Spider.
- ants-go - A open source, distributed, restful crawler engine in golang.
- scrape - A simple, higher level interface for Go web scraping.
- creeper - The Next Generation Crawler Framework (Go).
- colly - Fast and Elegant Scraping Framework for Gophers.
- ferret - Declarative web scraping.
- Dataflow kit - Extract structured data from web pages. Web sites scraping.
- Hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Scala
Top Related Projects
List of libraries, tools and APIs for web scraping and data processing.
爬虫集合
a curated list of awesome streaming frameworks, applications, etc
The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.
Python best practices guidebook, written for humans.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot