Top Related Projects
Scrapy, a fast high-level web crawling & scraping framework for Python.
A Powerful Spider(Web Crawler) System in Python.
Asynchronous HTTP client/server framework for asyncio and Python
A Python library for automating interaction with websites.
:bookmark: Personal mini-web in text
Quick Overview
PSpider is a simple and lightweight Python spider (web crawler) framework. It provides a set of tools and utilities for building web crawlers, with support for multi-threading and distributed crawling. The framework is designed to be easy to use and extend, making it suitable for both beginners and experienced developers.
Pros
- Simple and lightweight architecture, easy to understand and use
- Supports multi-threading for improved performance
- Extensible design allows for customization and addition of new features
- Includes built-in utilities for common crawling tasks (e.g., URL parsing, request handling)
Cons
- Limited documentation, especially for advanced features
- Not as feature-rich as some other popular web crawling frameworks
- May require additional libraries for certain functionalities (e.g., Selenium for JavaScript rendering)
- Less active development compared to more popular alternatives
Code Examples
- Basic spider setup:
from spider.spider import Spider
def parse_func(url, keys, deep, critical, parse_repeat, content):
return [], []
if __name__ == "__main__":
spider = Spider(parse_func, start_url_list=["http://example.com"])
spider.start_work_and_wait_done()
- Custom parser function:
def custom_parser(url, keys, deep, critical, parse_repeat, content):
# Extract links and data from content
new_urls = ["http://example.com/page1", "http://example.com/page2"]
new_data = [{"title": "Example Title", "content": "Example Content"}]
return new_urls, new_data
- Using a custom fetcher:
from spider.utilities import UrlFetcher
class CustomFetcher(UrlFetcher):
def url_fetch(self, url, keys, critical, fetch_repeat):
# Implement custom fetching logic
return 200, "Custom content"
spider = Spider(parse_func, fetcher=CustomFetcher())
Getting Started
-
Install PSpider:
pip install pspider
-
Create a basic spider:
from spider.spider import Spider def parse_func(url, keys, deep, critical, parse_repeat, content): # Implement parsing logic return [], [] spider = Spider(parse_func, start_url_list=["http://example.com"]) spider.start_work_and_wait_done()
-
Run the spider:
python your_spider_script.py
Competitor Comparisons
Scrapy, a fast high-level web crawling & scraping framework for Python.
Pros of Scrapy
- More mature and widely adopted project with extensive documentation
- Large ecosystem of extensions and middleware
- Built-in support for concurrent and distributed crawling
Cons of Scrapy
- Steeper learning curve for beginners
- More complex setup and configuration
- Potentially overkill for simple scraping tasks
Code Comparison
PSpider example:
spider = spider.WebSpider(fetcher, parser, saver, url_filter=None)
spider.set_start_url("http://example.com")
spider.start_work_and_wait_done(fetcher_num=10)
Scrapy example:
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
# Parsing logic here
PSpider offers a more straightforward API for basic scraping tasks, while Scrapy provides a more structured and extensible framework. PSpider's approach may be easier for beginners, but Scrapy's design allows for better scalability and customization in larger projects.
A Powerful Spider(Web Crawler) System in Python.
Pros of pyspider
- More comprehensive and feature-rich web crawling framework
- Includes a web-based user interface for managing and monitoring crawl jobs
- Supports distributed crawling and multiple backends (Redis, MongoDB, etc.)
Cons of pyspider
- Steeper learning curve due to its complexity
- Requires more setup and configuration for advanced features
- Less actively maintained (last update in 2019)
Code Comparison
PSpider:
spider = PSpider(fetcher, parser, saver, url_filter=None, monitor_sleep_time=5)
spider.set_start_url(url)
spider.start_work_and_wait_done(fetcher_num=10, is_over=True)
pyspider:
class Handler(BaseHandler):
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://example.com/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
PSpider offers a simpler API for basic crawling tasks, while pyspider provides a more structured approach with decorators and a class-based system for complex crawling scenarios.
Asynchronous HTTP client/server framework for asyncio and Python
Pros of aiohttp
- More comprehensive and widely-used library for asynchronous HTTP
- Supports both client and server-side operations
- Extensive documentation and community support
Cons of aiohttp
- Steeper learning curve, especially for those new to asynchronous programming
- Not specifically designed for web scraping, requiring additional setup for crawler functionality
Code Comparison
PSpider example:
spider = spider.WebSpider(fetcher, parser, saver, url_filter=None)
spider.set_start_url(url="http://www.example.com")
spider.start_work_and_wait_done(fetcher_num=10, is_over=True)
aiohttp example:
async with aiohttp.ClientSession() as session:
async with session.get('http://www.example.com') as response:
html = await response.text()
# Process the HTML content
Summary
While PSpider is a dedicated web crawling framework, aiohttp is a more general-purpose asynchronous HTTP library. PSpider offers a simpler API for web scraping tasks, but aiohttp provides greater flexibility and can be used for a wider range of applications. The choice between the two depends on the specific requirements of your project and your familiarity with asynchronous programming concepts.
A Python library for automating interaction with websites.
Pros of MechanicalSoup
- Built on top of BeautifulSoup, providing powerful HTML parsing capabilities
- Simulates browser behavior, allowing for interaction with JavaScript-rendered content
- Well-documented and actively maintained
Cons of MechanicalSoup
- Limited to Python, while PSpider supports multiple programming languages
- May be slower for large-scale scraping compared to PSpider's distributed architecture
- Less customizable for complex scraping scenarios
Code Comparison
MechanicalSoup:
browser = mechanicalsoup.StatefulBrowser()
browser.open("http://example.com")
browser.select_form('form[action="/submit"]')
browser["username"] = "user"
browser["password"] = "password"
response = browser.submit_selected()
PSpider:
class MySpider(spider.Spider):
def start_requests(self, urls):
for url in urls:
yield self.make_request_from_url(url)
def parse(self, response):
# Extract data and yield items
MechanicalSoup is more suitable for simulating user interactions and handling JavaScript-rendered content, while PSpider offers a more flexible and scalable approach for large-scale web scraping projects. The choice between the two depends on the specific requirements of your scraping task.
:bookmark: Personal mini-web in text
Pros of buku
- More actively maintained with frequent updates
- Broader functionality as a bookmark manager, not just a web crawler
- Cross-platform support (Linux, macOS, Windows)
Cons of buku
- Focused on bookmark management rather than web crawling
- Steeper learning curve due to more complex features
- Requires more system resources for full functionality
Code comparison
buku:
def add_rec(self, url, title_in, tags_in, desc, immutable, delay):
# Add a new bookmark record
# Implementation details...
PSpider:
def working(self, url, keys, deep, critical, parse_repeat, *args, **kwargs):
# Working function for spider
# Implementation details...
Summary
buku is a feature-rich bookmark manager with cross-platform support, while PSpider is a lightweight web crawling framework. buku offers more comprehensive functionality for managing and organizing bookmarks, but may be overkill for simple web scraping tasks. PSpider is more focused on web crawling but lacks the extensive bookmark management features of buku. The choice between the two depends on the specific needs of the project, whether it's primarily for bookmark management or web crawling.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PSpider
A simple web spider frame written by Python, which needs Python3.8+
Features of PSpider
- Support multi-threading crawling mode (using threading)
- Support using proxies for crawling (using threading and queue)
- Define some utility functions and classes, for example: UrlFilter, get_string_num, etc
- Fewer lines of code, easyer to read, understand and expand
Modules of PSpider
- utilities module: define some utilities functions and classes for multi-threading spider
- instances module: define classes of Fetcher, Parser, Saver for multi-threading spider
- concurrent module: define WebSpiderFrame of multi-threading spider
Procedure of PSpider
â : Fetchers get TaskFetch from QueueFetch, and make requests based on this task
â¡: Put the result(TaskParse) of â to QueueParse, and so Parser can get task from it
â¢: Parser gets task from QueueParse, and parses content to get new TaskFetchs and TaskSave
â£: Put the new TaskFetchs to QueueFetch, and so Fetchers can get task from it again
â¤: Put the TaskSave to QueueSave, and so Saver can get task from it
â¥: Saver gets TaskSave from QueueSave, and saves items to filesystem or database
â¦: Proxieser gets proxies from web or database, and puts proxies to QueueProxies
â§: Fetcher gets proxies from QueueProxies if needed, and makes requests based on this proxies
Tutorials of PSpider
Installation: you'd better use the first method
ï¼1ï¼Copy the "spider" directory to your project directory, and import spider
ï¼2ï¼Install spider to your python system using python3 setup.py install
See test.py
TodoList
- More Demos
- Distribute Spider
- Execute JavaScript
If you have any questions or advices, you can commit "Issues" or "Pull requests"
Top Related Projects
Scrapy, a fast high-level web crawling & scraping framework for Python.
A Powerful Spider(Web Crawler) System in Python.
Asynchronous HTTP client/server framework for asyncio and Python
A Python library for automating interaction with websites.
:bookmark: Personal mini-web in text
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot