Convert Figma logo to code with AI

xianhu logoPSpider

简单易用的Python爬虫框架,QQ交流群:597510560

1,836
503
1,836
3

Top Related Projects

55,024

Scrapy, a fast high-level web crawling & scraping framework for Python.

16,692

A Powerful Spider(Web Crawler) System in Python.

15,622

Asynchronous HTTP client/server framework for asyncio and Python

A Python library for automating interaction with websites.

6,692

:bookmark: Personal mini-web in text

Quick Overview

PSpider is a simple and lightweight Python spider (web crawler) framework. It provides a set of tools and utilities for building web crawlers, with support for multi-threading and distributed crawling. The framework is designed to be easy to use and extend, making it suitable for both beginners and experienced developers.

Pros

  • Simple and lightweight architecture, easy to understand and use
  • Supports multi-threading for improved performance
  • Extensible design allows for customization and addition of new features
  • Includes built-in utilities for common crawling tasks (e.g., URL parsing, request handling)

Cons

  • Limited documentation, especially for advanced features
  • Not as feature-rich as some other popular web crawling frameworks
  • May require additional libraries for certain functionalities (e.g., Selenium for JavaScript rendering)
  • Less active development compared to more popular alternatives

Code Examples

  1. Basic spider setup:
from spider.spider import Spider

def parse_func(url, keys, deep, critical, parse_repeat, content):
    return [], []

if __name__ == "__main__":
    spider = Spider(parse_func, start_url_list=["http://example.com"])
    spider.start_work_and_wait_done()
  1. Custom parser function:
def custom_parser(url, keys, deep, critical, parse_repeat, content):
    # Extract links and data from content
    new_urls = ["http://example.com/page1", "http://example.com/page2"]
    new_data = [{"title": "Example Title", "content": "Example Content"}]
    return new_urls, new_data
  1. Using a custom fetcher:
from spider.utilities import UrlFetcher

class CustomFetcher(UrlFetcher):
    def url_fetch(self, url, keys, critical, fetch_repeat):
        # Implement custom fetching logic
        return 200, "Custom content"

spider = Spider(parse_func, fetcher=CustomFetcher())

Getting Started

  1. Install PSpider:

    pip install pspider
    
  2. Create a basic spider:

    from spider.spider import Spider
    
    def parse_func(url, keys, deep, critical, parse_repeat, content):
        # Implement parsing logic
        return [], []
    
    spider = Spider(parse_func, start_url_list=["http://example.com"])
    spider.start_work_and_wait_done()
    
  3. Run the spider:

    python your_spider_script.py
    

Competitor Comparisons

55,024

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

  • More mature and widely adopted project with extensive documentation
  • Large ecosystem of extensions and middleware
  • Built-in support for concurrent and distributed crawling

Cons of Scrapy

  • Steeper learning curve for beginners
  • More complex setup and configuration
  • Potentially overkill for simple scraping tasks

Code Comparison

PSpider example:

spider = spider.WebSpider(fetcher, parser, saver, url_filter=None)
spider.set_start_url("http://example.com")
spider.start_work_and_wait_done(fetcher_num=10)

Scrapy example:

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://example.com']
    
    def parse(self, response):
        # Parsing logic here

PSpider offers a more straightforward API for basic scraping tasks, while Scrapy provides a more structured and extensible framework. PSpider's approach may be easier for beginners, but Scrapy's design allows for better scalability and customization in larger projects.

16,692

A Powerful Spider(Web Crawler) System in Python.

Pros of pyspider

  • More comprehensive and feature-rich web crawling framework
  • Includes a web-based user interface for managing and monitoring crawl jobs
  • Supports distributed crawling and multiple backends (Redis, MongoDB, etc.)

Cons of pyspider

  • Steeper learning curve due to its complexity
  • Requires more setup and configuration for advanced features
  • Less actively maintained (last update in 2019)

Code Comparison

PSpider:

spider = PSpider(fetcher, parser, saver, url_filter=None, monitor_sleep_time=5)
spider.set_start_url(url)
spider.start_work_and_wait_done(fetcher_num=10, is_over=True)

pyspider:

class Handler(BaseHandler):
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

PSpider offers a simpler API for basic crawling tasks, while pyspider provides a more structured approach with decorators and a class-based system for complex crawling scenarios.

15,622

Asynchronous HTTP client/server framework for asyncio and Python

Pros of aiohttp

  • More comprehensive and widely-used library for asynchronous HTTP
  • Supports both client and server-side operations
  • Extensive documentation and community support

Cons of aiohttp

  • Steeper learning curve, especially for those new to asynchronous programming
  • Not specifically designed for web scraping, requiring additional setup for crawler functionality

Code Comparison

PSpider example:

spider = spider.WebSpider(fetcher, parser, saver, url_filter=None)
spider.set_start_url(url="http://www.example.com")
spider.start_work_and_wait_done(fetcher_num=10, is_over=True)

aiohttp example:

async with aiohttp.ClientSession() as session:
    async with session.get('http://www.example.com') as response:
        html = await response.text()
        # Process the HTML content

Summary

While PSpider is a dedicated web crawling framework, aiohttp is a more general-purpose asynchronous HTTP library. PSpider offers a simpler API for web scraping tasks, but aiohttp provides greater flexibility and can be used for a wider range of applications. The choice between the two depends on the specific requirements of your project and your familiarity with asynchronous programming concepts.

A Python library for automating interaction with websites.

Pros of MechanicalSoup

  • Built on top of BeautifulSoup, providing powerful HTML parsing capabilities
  • Simulates browser behavior, allowing for interaction with JavaScript-rendered content
  • Well-documented and actively maintained

Cons of MechanicalSoup

  • Limited to Python, while PSpider supports multiple programming languages
  • May be slower for large-scale scraping compared to PSpider's distributed architecture
  • Less customizable for complex scraping scenarios

Code Comparison

MechanicalSoup:

browser = mechanicalsoup.StatefulBrowser()
browser.open("http://example.com")
browser.select_form('form[action="/submit"]')
browser["username"] = "user"
browser["password"] = "password"
response = browser.submit_selected()

PSpider:

class MySpider(spider.Spider):
    def start_requests(self, urls):
        for url in urls:
            yield self.make_request_from_url(url)

    def parse(self, response):
        # Extract data and yield items

MechanicalSoup is more suitable for simulating user interactions and handling JavaScript-rendered content, while PSpider offers a more flexible and scalable approach for large-scale web scraping projects. The choice between the two depends on the specific requirements of your scraping task.

6,692

:bookmark: Personal mini-web in text

Pros of buku

  • More actively maintained with frequent updates
  • Broader functionality as a bookmark manager, not just a web crawler
  • Cross-platform support (Linux, macOS, Windows)

Cons of buku

  • Focused on bookmark management rather than web crawling
  • Steeper learning curve due to more complex features
  • Requires more system resources for full functionality

Code comparison

buku:

def add_rec(self, url, title_in, tags_in, desc, immutable, delay):
    # Add a new bookmark record
    # Implementation details...

PSpider:

def working(self, url, keys, deep, critical, parse_repeat, *args, **kwargs):
    # Working function for spider
    # Implementation details...

Summary

buku is a feature-rich bookmark manager with cross-platform support, while PSpider is a lightweight web crawling framework. buku offers more comprehensive functionality for managing and organizing bookmarks, but may be overkill for simple web scraping tasks. PSpider is more focused on web crawling but lacks the extensive bookmark management features of buku. The choice between the two depends on the specific needs of the project, whether it's primarily for bookmark management or web crawling.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PSpider

A simple web spider frame written by Python, which needs Python3.8+

Features of PSpider

  1. Support multi-threading crawling mode (using threading)
  2. Support using proxies for crawling (using threading and queue)
  3. Define some utility functions and classes, for example: UrlFilter, get_string_num, etc
  4. Fewer lines of code, easyer to read, understand and expand

Modules of PSpider

  1. utilities module: define some utilities functions and classes for multi-threading spider
  2. instances module: define classes of Fetcher, Parser, Saver for multi-threading spider
  3. concurrent module: define WebSpiderFrame of multi-threading spider

Procedure of PSpider

①: Fetchers get TaskFetch from QueueFetch, and make requests based on this task
②: Put the result(TaskParse) of ① to QueueParse, and so Parser can get task from it
③: Parser gets task from QueueParse, and parses content to get new TaskFetchs and TaskSave
④: Put the new TaskFetchs to QueueFetch, and so Fetchers can get task from it again
⑤: Put the TaskSave to QueueSave, and so Saver can get task from it
⑥: Saver gets TaskSave from QueueSave, and saves items to filesystem or database
⑦: Proxieser gets proxies from web or database, and puts proxies to QueueProxies
⑧: Fetcher gets proxies from QueueProxies if needed, and makes requests based on this proxies

Tutorials of PSpider

Installation: you'd better use the first method
(1)Copy the "spider" directory to your project directory, and import spider
(2)Install spider to your python system using python3 setup.py install

See test.py

TodoList

  1. More Demos
  2. Distribute Spider
  3. Execute JavaScript

If you have any questions or advices, you can commit "Issues" or "Pull requests"