pyspider

A Powerful Spider(Web Crawler) System in Python.

16,692

3,685

16,692

302

View on GitHub

Top Related Projects

scrapy

57,817

Scrapy, a fast high-level web crawling & scraping framework for Python.

ddgr

3,092

:duck: DuckDuckGo from the terminal

fake-useragent

3,931

Up-to-date simple useragent faker with real world database

Quick Overview

PySpider is a powerful and extensible web crawling and scraping framework written in Python. It provides a user-friendly web interface for managing crawling tasks, supports multiple backends for task queue and result storage, and offers a scripting environment for writing custom spiders.

Pros

User-friendly web interface for managing and monitoring crawling tasks
Supports multiple backends for distributed task queue and result storage
Extensible architecture allowing for custom processors and result handlers
Built-in JavaScript rendering support using PhantomJS

Cons

Limited documentation and examples for advanced use cases
Some dependencies may be outdated or no longer maintained
Learning curve for users new to web scraping or Python programming
Performance may be slower compared to more lightweight scraping libraries

Code Examples

Basic spider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        }
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

This example demonstrates a basic spider that crawls a website, extracts links, and saves the title of each page.

Using PhantomJS for JavaScript rendering:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    @config(js_script="""
    function() {
        window.scrollTo(0, document.body.scrollHeight);
        return 'scrolled';
    }
    """)
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page, fetch_type='js')

    def index_page(self, response):
        # Process the rendered page
        pass

This example shows how to use PhantomJS to render JavaScript content before processing the page.

Handling pagination:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/page/1', callback=self.index_page)

    def index_page(self, response):
        for item in response.doc('.item').items():
            self.crawl(item.attr.href, callback=self.detail_page)

        next_page = response.doc('.next-page').attr.href
        if next_page:
            self.crawl(next_page, callback=self.index_page)

    def detail_page(self, response):
        # Process the detail page
        pass

This example demonstrates how to handle pagination by following the "next page" link.

Getting Started

Install PySpider:
```
pip install pyspider
```
Start the PySpider server:
```
pyspider
```
Open your web browser and navigate to http://localhost:5000 to access the PySpider web interface.
Create a new project and paste one of the example spiders into the script editor.
Press the "Run" button to start the crawling task.

Competitor Comparisons

scrapy

57,817

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

More mature and widely adopted project with a larger community
Extensive documentation and ecosystem of extensions
Better suited for large-scale, distributed crawling projects

Cons of Scrapy

Steeper learning curve, especially for beginners
More complex setup and configuration required
Less intuitive for simple scraping tasks

Code Comparison

Scrapy example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        yield {'title': response.css('h1::text').get()}

PySpider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    @every(minutes=24*60)
    def on_start(self):
        self.crawl('http://example.com', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        return {'title': response.doc('h1').text()}

Both Scrapy and PySpider are powerful web scraping frameworks, but they cater to different needs. Scrapy is more suitable for large-scale projects and offers greater flexibility, while PySpider provides a more user-friendly interface and is easier to set up for simpler scraping tasks. The choice between the two depends on the project requirements, scale, and the developer's experience level.

ddgr

3,092

:duck: DuckDuckGo from the terminal

Pros of ddgr

Lightweight command-line tool focused on DuckDuckGo searches
Simple to use with minimal dependencies
Faster execution for specific search tasks

Cons of ddgr

Limited in scope compared to pyspider's full web crawling capabilities
Lacks a web interface and distributed architecture
Not suitable for complex, large-scale web scraping projects

Code Comparison

ddgr (Python):

def fetch(url, headers=None):
    try:
        res = requests.get(url, headers=headers, timeout=10)
        return res.text
    except Exception as e:
        print(e)
        return None

pyspider (Python):

@config(age=10 * 24 * 60 * 60)
def on_start(self):
    self.crawl('http://example.com/', callback=self.index_page)

def index_page(self, response):
    for each in response.doc('a[href^="http"]').items():
        self.crawl(each.attr.href, callback=self.detail_page)

While ddgr focuses on fetching search results from DuckDuckGo, pyspider offers a more comprehensive framework for web crawling with features like scheduling, distributed processing, and a web interface. ddgr is better suited for quick command-line searches, while pyspider excels in complex web scraping tasks.

grab

2,404

Web Scraping Framework

Pros of Grab

More lightweight and focused on web scraping tasks
Simpler API and easier to get started for basic scraping needs
Better support for asynchronous operations using asyncio

Cons of Grab

Less comprehensive feature set compared to PySpider
Smaller community and fewer resources available
Lacks a built-in web interface for managing scraping tasks

Code Comparison

PySpider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

Grab example:

from grab import Grab

g = Grab()
g.go('http://example.com/')
for link in g.doc.select('//a[@href]'):
    g.go(link.attr('href'))
    # Process the page

Both libraries offer straightforward ways to crawl websites, but PySpider provides a more structured approach with its handler-based system, while Grab offers a more direct and flexible method for simple scraping tasks.

ferret

5,825

Declarative web scraping

Pros of Ferret

Written in Go, offering better performance and concurrency handling
Supports declarative scraping with FQL (Ferret Query Language)
Provides a more modern and actively maintained codebase

Cons of Ferret

Smaller community and ecosystem compared to PySpider
Steeper learning curve for those unfamiliar with Go or FQL
Less extensive documentation and tutorials available

Code Comparison

PySpider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

Ferret example:

LET doc = DOCUMENT("https://example.com/")
LET links = ELEMENTS(doc, "a")

FOR link IN links
    FILTER STARTS_WITH(link.href, "http")
    LET page = DOCUMENT(link.href)
    // Process the page

Both examples demonstrate basic crawling functionality, but Ferret uses a more declarative approach with FQL, while PySpider relies on Python classes and methods for defining the crawling behavior.

fake-useragent

3,931

Up-to-date simple useragent faker with real world database

Pros of fake-useragent

Lightweight and focused on a single task: generating random user agents
Easy to integrate into existing projects without additional dependencies
Regularly updated with new user agent strings

Cons of fake-useragent

Limited functionality compared to pyspider's full-featured web crawling capabilities
Lacks built-in scraping and data processing features
May require additional libraries for more complex web scraping tasks

Code Comparison

fake-useragent:

from fake_useragent import UserAgent
ua = UserAgent()
random_user_agent = ua.random

pyspider:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        return {
            "title": response.doc('title').text(),
        }

Summary

fake-useragent is a specialized library for generating random user agents, while pyspider is a comprehensive web crawling framework. fake-useragent is easier to integrate for simple user agent rotation, but pyspider offers more advanced features for complex web scraping tasks. The choice between the two depends on the specific requirements of your project.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pyspider

A Powerful Spider(Web Crawler) System in Python.

Write script in Python
Powerful WebUI with script editor, task monitor, project manager and result viewer
MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
RabbitMQ, Redis and Kombu as message queue
Task priority, retry, periodical, recrawl by age, etc...
Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Installation

pip install pyspider
run command pyspider, visit http://localhost:5000/

WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth for webui.

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

Use It
Open Issue, send PR
User Group
ä¸æé®ç

TODO

v0.4.0

a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot