Convert Figma logo to code with AI

binux logopyspider

A Powerful Spider(Web Crawler) System in Python.

16,692
3,685
16,692
302

Top Related Projects

55,024

Scrapy, a fast high-level web crawling & scraping framework for Python.

3,092

:duck: DuckDuckGo from the terminal

2,404

Web Scraping Framework

5,825

Declarative web scraping

Up-to-date simple useragent faker with real world database

Quick Overview

PySpider is a powerful and extensible web crawling and scraping framework written in Python. It provides a user-friendly web interface for managing crawling tasks, supports multiple backends for task queue and result storage, and offers a scripting environment for writing custom spiders.

Pros

  • User-friendly web interface for managing and monitoring crawling tasks
  • Supports multiple backends for distributed task queue and result storage
  • Extensible architecture allowing for custom processors and result handlers
  • Built-in JavaScript rendering support using PhantomJS

Cons

  • Limited documentation and examples for advanced use cases
  • Some dependencies may be outdated or no longer maintained
  • Learning curve for users new to web scraping or Python programming
  • Performance may be slower compared to more lightweight scraping libraries

Code Examples

  1. Basic spider example:
from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        }
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

This example demonstrates a basic spider that crawls a website, extracts links, and saves the title of each page.

  1. Using PhantomJS for JavaScript rendering:
from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    @config(js_script="""
    function() {
        window.scrollTo(0, document.body.scrollHeight);
        return 'scrolled';
    }
    """)
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page, fetch_type='js')

    def index_page(self, response):
        # Process the rendered page
        pass

This example shows how to use PhantomJS to render JavaScript content before processing the page.

  1. Handling pagination:
from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/page/1', callback=self.index_page)

    def index_page(self, response):
        for item in response.doc('.item').items():
            self.crawl(item.attr.href, callback=self.detail_page)

        next_page = response.doc('.next-page').attr.href
        if next_page:
            self.crawl(next_page, callback=self.index_page)

    def detail_page(self, response):
        # Process the detail page
        pass

This example demonstrates how to handle pagination by following the "next page" link.

Getting Started

  1. Install PySpider:

    pip install pyspider
    
  2. Start the PySpider server:

    pyspider
    
  3. Open your web browser and navigate to http://localhost:5000 to access the PySpider web interface.

  4. Create a new project and paste one of the example spiders into the script editor.

  5. Press the "Run" button to start the crawling task.

Competitor Comparisons

55,024

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

  • More mature and widely adopted project with a larger community
  • Extensive documentation and ecosystem of extensions
  • Better suited for large-scale, distributed crawling projects

Cons of Scrapy

  • Steeper learning curve, especially for beginners
  • More complex setup and configuration required
  • Less intuitive for simple scraping tasks

Code Comparison

Scrapy example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        yield {'title': response.css('h1::text').get()}

PySpider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    @every(minutes=24*60)
    def on_start(self):
        self.crawl('http://example.com', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        return {'title': response.doc('h1').text()}

Both Scrapy and PySpider are powerful web scraping frameworks, but they cater to different needs. Scrapy is more suitable for large-scale projects and offers greater flexibility, while PySpider provides a more user-friendly interface and is easier to set up for simpler scraping tasks. The choice between the two depends on the project requirements, scale, and the developer's experience level.

3,092

:duck: DuckDuckGo from the terminal

Pros of ddgr

  • Lightweight command-line tool focused on DuckDuckGo searches
  • Simple to use with minimal dependencies
  • Faster execution for specific search tasks

Cons of ddgr

  • Limited in scope compared to pyspider's full web crawling capabilities
  • Lacks a web interface and distributed architecture
  • Not suitable for complex, large-scale web scraping projects

Code Comparison

ddgr (Python):

def fetch(url, headers=None):
    try:
        res = requests.get(url, headers=headers, timeout=10)
        return res.text
    except Exception as e:
        print(e)
        return None

pyspider (Python):

@config(age=10 * 24 * 60 * 60)
def on_start(self):
    self.crawl('http://example.com/', callback=self.index_page)

def index_page(self, response):
    for each in response.doc('a[href^="http"]').items():
        self.crawl(each.attr.href, callback=self.detail_page)

While ddgr focuses on fetching search results from DuckDuckGo, pyspider offers a more comprehensive framework for web crawling with features like scheduling, distributed processing, and a web interface. ddgr is better suited for quick command-line searches, while pyspider excels in complex web scraping tasks.

2,404

Web Scraping Framework

Pros of Grab

  • More lightweight and focused on web scraping tasks
  • Simpler API and easier to get started for basic scraping needs
  • Better support for asynchronous operations using asyncio

Cons of Grab

  • Less comprehensive feature set compared to PySpider
  • Smaller community and fewer resources available
  • Lacks a built-in web interface for managing scraping tasks

Code Comparison

PySpider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

Grab example:

from grab import Grab

g = Grab()
g.go('http://example.com/')
for link in g.doc.select('//a[@href]'):
    g.go(link.attr('href'))
    # Process the page

Both libraries offer straightforward ways to crawl websites, but PySpider provides a more structured approach with its handler-based system, while Grab offers a more direct and flexible method for simple scraping tasks.

5,825

Declarative web scraping

Pros of Ferret

  • Written in Go, offering better performance and concurrency handling
  • Supports declarative scraping with FQL (Ferret Query Language)
  • Provides a more modern and actively maintained codebase

Cons of Ferret

  • Smaller community and ecosystem compared to PySpider
  • Steeper learning curve for those unfamiliar with Go or FQL
  • Less extensive documentation and tutorials available

Code Comparison

PySpider example:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

Ferret example:

LET doc = DOCUMENT("https://example.com/")
LET links = ELEMENTS(doc, "a")

FOR link IN links
    FILTER STARTS_WITH(link.href, "http")
    LET page = DOCUMENT(link.href)
    // Process the page

Both examples demonstrate basic crawling functionality, but Ferret uses a more declarative approach with FQL, while PySpider relies on Python classes and methods for defining the crawling behavior.

Up-to-date simple useragent faker with real world database

Pros of fake-useragent

  • Lightweight and focused on a single task: generating random user agents
  • Easy to integrate into existing projects without additional dependencies
  • Regularly updated with new user agent strings

Cons of fake-useragent

  • Limited functionality compared to pyspider's full-featured web crawling capabilities
  • Lacks built-in scraping and data processing features
  • May require additional libraries for more complex web scraping tasks

Code Comparison

fake-useragent:

from fake_useragent import UserAgent
ua = UserAgent()
random_user_agent = ua.random

pyspider:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        return {
            "title": response.doc('title').text(),
        }

Summary

fake-useragent is a specialized library for generating random user agents, while pyspider is a comprehensive web crawling framework. fake-useragent is easier to integrate for simple user agent rotation, but pyspider offers more advanced features for complex web scraping tasks. The choice between the two depends on the specific requirements of your project.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pyspider Build Status Coverage Status

A Powerful Spider(Web Crawler) System in Python.

  • Write script in Python
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
  • RabbitMQ, Redis and Kombu as message queue
  • Task priority, retry, periodical, recrawl by age, etc...
  • Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Installation

WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth for webui.

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0