PSpider

简单易用的Python爬虫框架，QQ交流群：597510560

1,841

502

1,841

View on GitHub

Top Related Projects

scrapy

57,817

Scrapy, a fast high-level web crawling & scraping framework for Python.

pyspider

16,692

A Powerful Spider(Web Crawler) System in Python.

aiohttp

15,880

Asynchronous HTTP client/server framework for asyncio and Python

MechanicalSoup

4,781

A Python library for automating interaction with websites.

buku

6,810

:bookmark: Personal mini-web in text

Quick Overview

PSpider is a simple and lightweight Python spider (web crawler) framework. It provides a set of tools and utilities for building web crawlers, with support for multi-threading and distributed crawling. The framework is designed to be easy to use and extend, making it suitable for both beginners and experienced developers.

Pros

Simple and lightweight architecture, easy to understand and use
Supports multi-threading for improved performance
Extensible design allows for customization and addition of new features
Includes built-in utilities for common crawling tasks (e.g., URL parsing, request handling)

Cons

Limited documentation, especially for advanced features
Not as feature-rich as some other popular web crawling frameworks
May require additional libraries for certain functionalities (e.g., Selenium for JavaScript rendering)
Less active development compared to more popular alternatives

Code Examples

Basic spider setup:

from spider.spider import Spider

def parse_func(url, keys, deep, critical, parse_repeat, content):
    return [], []

if __name__ == "__main__":
    spider = Spider(parse_func, start_url_list=["http://example.com"])
    spider.start_work_and_wait_done()

Custom parser function:

def custom_parser(url, keys, deep, critical, parse_repeat, content):
    # Extract links and data from content
    new_urls = ["http://example.com/page1", "http://example.com/page2"]
    new_data = [{"title": "Example Title", "content": "Example Content"}]
    return new_urls, new_data

Using a custom fetcher:

from spider.utilities import UrlFetcher

class CustomFetcher(UrlFetcher):
    def url_fetch(self, url, keys, critical, fetch_repeat):
        # Implement custom fetching logic
        return 200, "Custom content"

spider = Spider(parse_func, fetcher=CustomFetcher())

Getting Started

Install PSpider:
```
pip install pspider
```

Create a basic spider:

from spider.spider import Spider

def parse_func(url, keys, deep, critical, parse_repeat, content):
    # Implement parsing logic
    return [], []

spider = Spider(parse_func, start_url_list=["http://example.com"])
spider.start_work_and_wait_done()

Run the spider:
```
python your_spider_script.py
```

Competitor Comparisons

scrapy

57,817

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

More mature and widely adopted project with extensive documentation
Large ecosystem of extensions and middleware
Built-in support for concurrent and distributed crawling

Cons of Scrapy

Steeper learning curve for beginners
More complex setup and configuration
Potentially overkill for simple scraping tasks

Code Comparison

PSpider example:

spider = spider.WebSpider(fetcher, parser, saver, url_filter=None)
spider.set_start_url("http://example.com")
spider.start_work_and_wait_done(fetcher_num=10)

Scrapy example:

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://example.com']
    
    def parse(self, response):
        # Parsing logic here

PSpider offers a more straightforward API for basic scraping tasks, while Scrapy provides a more structured and extensible framework. PSpider's approach may be easier for beginners, but Scrapy's design allows for better scalability and customization in larger projects.

pyspider

16,692

A Powerful Spider(Web Crawler) System in Python.

Pros of pyspider

More comprehensive and feature-rich web crawling framework
Includes a web-based user interface for managing and monitoring crawl jobs
Supports distributed crawling and multiple backends (Redis, MongoDB, etc.)

Cons of pyspider

Steeper learning curve due to its complexity
Requires more setup and configuration for advanced features
Less actively maintained (last update in 2019)

Code Comparison

PSpider:

spider = PSpider(fetcher, parser, saver, url_filter=None, monitor_sleep_time=5)
spider.set_start_url(url)
spider.start_work_and_wait_done(fetcher_num=10, is_over=True)

pyspider:

class Handler(BaseHandler):
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

PSpider offers a simpler API for basic crawling tasks, while pyspider provides a more structured approach with decorators and a class-based system for complex crawling scenarios.

aiohttp

15,880

Asynchronous HTTP client/server framework for asyncio and Python

Pros of aiohttp

More comprehensive and widely-used library for asynchronous HTTP
Supports both client and server-side operations
Extensive documentation and community support

Cons of aiohttp

Steeper learning curve, especially for those new to asynchronous programming
Not specifically designed for web scraping, requiring additional setup for crawler functionality

Code Comparison

PSpider example:

spider = spider.WebSpider(fetcher, parser, saver, url_filter=None)
spider.set_start_url(url="http://www.example.com")
spider.start_work_and_wait_done(fetcher_num=10, is_over=True)

aiohttp example:

async with aiohttp.ClientSession() as session:
    async with session.get('http://www.example.com') as response:
        html = await response.text()
        # Process the HTML content

Summary

While PSpider is a dedicated web crawling framework, aiohttp is a more general-purpose asynchronous HTTP library. PSpider offers a simpler API for web scraping tasks, but aiohttp provides greater flexibility and can be used for a wider range of applications. The choice between the two depends on the specific requirements of your project and your familiarity with asynchronous programming concepts.

MechanicalSoup

4,781

A Python library for automating interaction with websites.

Pros of MechanicalSoup

Built on top of BeautifulSoup, providing powerful HTML parsing capabilities
Simulates browser behavior, allowing for interaction with JavaScript-rendered content
Well-documented and actively maintained

Cons of MechanicalSoup

Limited to Python, while PSpider supports multiple programming languages
May be slower for large-scale scraping compared to PSpider's distributed architecture
Less customizable for complex scraping scenarios

Code Comparison

MechanicalSoup:

browser = mechanicalsoup.StatefulBrowser()
browser.open("http://example.com")
browser.select_form('form[action="/submit"]')
browser["username"] = "user"
browser["password"] = "password"
response = browser.submit_selected()

PSpider:

class MySpider(spider.Spider):
    def start_requests(self, urls):
        for url in urls:
            yield self.make_request_from_url(url)

    def parse(self, response):
        # Extract data and yield items

MechanicalSoup is more suitable for simulating user interactions and handling JavaScript-rendered content, while PSpider offers a more flexible and scalable approach for large-scale web scraping projects. The choice between the two depends on the specific requirements of your scraping task.

buku

6,810

:bookmark: Personal mini-web in text

Pros of buku

More actively maintained with frequent updates
Broader functionality as a bookmark manager, not just a web crawler
Cross-platform support (Linux, macOS, Windows)

Cons of buku

Focused on bookmark management rather than web crawling
Steeper learning curve due to more complex features
Requires more system resources for full functionality

Code comparison

buku:

def add_rec(self, url, title_in, tags_in, desc, immutable, delay):
    # Add a new bookmark record
    # Implementation details...

PSpider:

def working(self, url, keys, deep, critical, parse_repeat, *args, **kwargs):
    # Working function for spider
    # Implementation details...

Summary

buku is a feature-rich bookmark manager with cross-platform support, while PSpider is a lightweight web crawling framework. buku offers more comprehensive functionality for managing and organizing bookmarks, but may be overkill for simple web scraping tasks. PSpider is more focused on web crawling but lacks the extensive bookmark management features of buku. The choice between the two depends on the specific needs of the project, whether it's primarily for bookmark management or web crawling.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PSpider

A simple web spider frame written by Python, which needs Python3.8+

Features of PSpider

Support multi-threading crawling mode (using threading)
Support using proxies for crawling (using threading and queue)
Define some utility functions and classes, for example: UrlFilter, get_string_num, etc
Fewer lines of code, easyer to read, understand and expand

Modules of PSpider

utilities module: define some utilities functions and classes for multi-threading spider
instances module: define classes of Fetcher, Parser, Saver for multi-threading spider
concurrent module: define WebSpiderFrame of multi-threading spider

Procedure of PSpider

â : Fetchers get TaskFetch from QueueFetch, and make requests based on this task
â¡: Put the result(TaskParse) of â to QueueParse, and so Parser can get task from it
â¢: Parser gets task from QueueParse, and parses content to get new TaskFetchs and TaskSave
â£: Put the new TaskFetchs to QueueFetch, and so Fetchers can get task from it again
â¤: Put the TaskSave to QueueSave, and so Saver can get task from it
â¥: Saver gets TaskSave from QueueSave, and saves items to filesystem or database
â¦: Proxieser gets proxies from web or database, and puts proxies to QueueProxies
â§: Fetcher gets proxies from QueueProxies if needed, and makes requests based on this proxies

Tutorials of PSpider

Installation: you'd better use the first method
ï¼1ï¼Copy the "spider" directory to your project directory, and import spider
ï¼2ï¼Install spider to your python system using python3 setup.py install

See test.py

TodoList

More Demos
Distribute Spider
Execute JavaScript

If you have any questions or advices, you can commit "Issues" or "Pull requests"

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot