grab

Web Scraping Framework

2,404

275

2,404

View on GitHub

Top Related Projects

scrapy

57,817

Scrapy, a fast high-level web crawling & scraping framework for Python.

colly

24,391

Elegant Scraper and Crawler Framework for Golang

gospider

2,752

Gospider - Fast web spider written in Go

pyspider

16,692

A Powerful Spider(Web Crawler) System in Python.

puppeteer

91,008

JavaScript API for Chrome and Firefox

Quick Overview

Grab is a web scraping framework for Python. It provides a simple and intuitive interface for extracting data from websites, handling various network issues, and managing concurrent requests. Grab is designed to be both powerful and easy to use, making it suitable for both small scripts and large scraping projects.

Pros

Easy to use with a clean and intuitive API
Supports both synchronous and asynchronous scraping
Handles common web scraping challenges like cookies, redirects, and retries
Extensible through plugins and custom extensions

Cons

Less active development compared to some other scraping libraries
Documentation could be more comprehensive and up-to-date
Limited built-in support for JavaScript rendering
Smaller community compared to more popular alternatives like Scrapy

Code Examples

Basic usage to fetch a web page:

from grab import Grab

g = Grab()
response = g.go('https://example.com')
print(response.body)

Extracting data using CSS selectors:

g = Grab()
g.go('https://example.com')
title = g.doc.select('//title').text()
links = g.doc.select('//a/@href').text_list()

Handling forms and submitting data:

g = Grab()
g.go('https://example.com/login')
g.doc.set_input('username', 'myuser')
g.doc.set_input('password', 'mypass')
g.doc.submit()

Getting Started

To get started with Grab, first install it using pip:

pip install grab

Then, you can create a simple script to fetch a web page:

from grab import Grab

g = Grab()
response = g.go('https://example.com')
print(response.body)

# Extract data using CSS selectors
title = g.doc.select('//title').text()
print(f"Page title: {title}")

# Follow links
for link in g.doc.select('//a/@href'):
    print(f"Found link: {link.text()}")

This basic example demonstrates how to fetch a page, extract data, and iterate through links. Grab offers many more features for handling complex scraping tasks, which you can explore in the documentation.

Competitor Comparisons

scrapy

57,817

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

More extensive documentation and larger community support
Built-in support for handling concurrent requests
Robust middleware and pipeline system for customization

Cons of Scrapy

Steeper learning curve for beginners
More complex setup and configuration
Less intuitive for simple scraping tasks

Code Comparison

Scrapy example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        yield {'title': response.css('h1::text').get()}

Grab example:

from grab import Grab

g = Grab()
g.go('http://example.com')
title = g.doc.select('//h1').text()
print(title)

Key Differences

Scrapy uses a spider-based approach, while Grab uses a more procedural style
Scrapy has built-in support for generating structured data, whereas Grab requires manual parsing
Grab offers a simpler API for basic scraping tasks, making it more accessible for beginners

Use Cases

Scrapy: Large-scale web scraping projects, complex data extraction tasks
Grab: Quick and simple scraping tasks, projects with limited scope

Both libraries have their strengths, and the choice between them depends on the specific requirements of your project and your level of expertise in web scraping.

colly

24,391

Elegant Scraper and Crawler Framework for Golang

Pros of Colly

Written in Go, offering better performance and concurrency handling
More actively maintained with frequent updates and contributions
Extensive documentation and examples available

Cons of Colly

Limited to web scraping tasks, while Grab supports general file downloading
Steeper learning curve for developers not familiar with Go

Code Comparison

Colly:

c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
})
c.Visit("http://example.com/")

Grab:

g = Grab()
g.go('http://example.com')
for link in g.doc.select('//a/@href'):
    print(link.text())

Summary

Colly is a powerful web scraping framework in Go, offering high performance and concurrency. It's well-maintained and documented but focuses solely on web scraping. Grab, written in Python, is more versatile for general file downloading tasks and may be easier for Python developers. The choice between them depends on the specific project requirements, performance needs, and the development team's expertise.

gospider

2,752

Gospider - Fast web spider written in Go

Pros of gospider

More focused on web crawling and information gathering
Supports multiple output formats (JSON, Markdown, CSV)
Includes features like subdomain enumeration and JavaScript parsing

Cons of gospider

Less versatile for general-purpose scraping tasks
May require more setup and configuration for basic use cases
Limited documentation compared to grab

Code comparison

gospider:

func main() {
    flag.Parse()
    core.Banner()
    options := core.ParseOptions()
    core.Start(options)
}

grab:

def main():
    bot = grab.Grab()
    resp = bot.go('http://example.com')
    print(resp.body)

Key differences

gospider is written in Go, while grab is written in Python
gospider focuses on web crawling and reconnaissance, grab is a more general-purpose scraping framework
gospider offers more built-in features for web security testing and information gathering
grab provides a simpler API for basic scraping tasks and integrates well with other Python libraries

Both tools have their strengths, with gospider being more suitable for security-focused web crawling and grab offering a more flexible approach for general web scraping tasks.

pyspider

16,692

A Powerful Spider(Web Crawler) System in Python.

Pros of pyspider

Built-in web interface for task management and result visualization
Supports distributed architecture for scalability
Includes a powerful scheduler for handling complex crawling tasks

Cons of pyspider

Less frequently updated compared to Grab
Steeper learning curve due to its more complex architecture
Limited documentation and community support

Code Comparison

pyspider:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://example.com/', callback=self.index_page)

    def index_page(self, response):
        return {
            "title": response.doc('title').text(),
        }

Grab:

from grab import Grab

g = Grab()
g.go('http://example.com/')
title = g.doc.select('//title').text()
print(title)

Summary

pyspider offers a more comprehensive solution with its web interface and distributed architecture, making it suitable for large-scale projects. However, Grab is simpler to use and more frequently maintained, making it a better choice for smaller projects or those requiring quick implementation. The code comparison shows that pyspider uses a class-based approach with callbacks, while Grab employs a more straightforward, procedural style.

ferret

5,825

Declarative web scraping

Pros of Ferret

Built with Go, offering better performance and concurrency support
Supports declarative web scraping with AQL (Advanced Query Language)
Provides a more comprehensive set of features for complex web scraping tasks

Cons of Ferret

Steeper learning curve due to AQL and more advanced features
Less straightforward for simple scraping tasks compared to Grab
Smaller community and fewer resources available for beginners

Code Comparison

Ferret (AQL):

LET doc = DOCUMENT("https://example.com")
FOR el IN ELEMENTS(doc, "div.product")
    RETURN {
        name: INNER_TEXT(el.querySelector("h2")),
        price: INNER_TEXT(el.querySelector("span.price"))
    }

Grab (Python):

from grab import Grab

g = Grab()
g.go("https://example.com")
products = g.doc.select("//div[@class='product']")
for product in products:
    name = product.select("h2").text()
    price = product.select("span[@class='price']").text()

Both libraries offer powerful web scraping capabilities, but Ferret provides a more declarative approach with its AQL, while Grab offers a simpler, more Pythonic interface. Ferret may be better suited for complex, high-performance scraping tasks, while Grab excels in ease of use and quick setup for simpler projects.

puppeteer

91,008

JavaScript API for Chrome and Firefox

Pros of Puppeteer

More comprehensive browser automation capabilities, including full Chrome/Chromium control
Stronger community support and more frequent updates
Better documentation and extensive API

Cons of Puppeteer

Heavier resource usage due to full browser control
Steeper learning curve for simple scraping tasks
Limited to JavaScript, while Grab supports multiple languages

Code Comparison

Puppeteer example:

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const title = await page.title();
await browser.close();

Grab example:

g = Grab()
g.go('https://example.com')
title = g.doc.select('//title').text()

Puppeteer offers more control over browser interactions, while Grab provides a simpler interface for basic scraping tasks. Puppeteer is better suited for complex web automation scenarios, whereas Grab excels in quick and straightforward data extraction. The choice between the two depends on the specific requirements of your project and your preferred programming language.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Grab Framework Project

Status of Project

I myself have not used Grab for many years. I am not sure it is being used by anybody at present time. Nonetheless I decided to refactor the project, just for fun. I have annotated whole code base with mypy type hints (in strict mode). Also the whole code base complies to pylint and flake8 requirements. There are few exceptions: very large methods and classes with too many local atributes and variables. I will refactor them eventually.

The current and the only network backend is urllib3.

I have refactored a few components into external packages: proxylist, procstat, selection, unicodec, user_agent

Feel free to give feedback in Telegram groups: @grablab and @grablab_ru

Things to be done next

Refactor source code to remove all pylint disable comments like:
- too-many-instance-attributes
- too-many-arguments
- too-many-locals
- too-many-public-methods
Make 100% test coverage, it is about 95% now
Release new version to pypi
Refactor more components into external packages
More abstract interfaces
More data structures and types
Decouple connections between internal components

Installation

That will install old Grab released in 2018 year: pip install -U grab

The updated Grab available in github repository is 100% not compatible with spiders and crawlers written for Grab released in 2018 year.

Documentation

Updated documenation is here https://grab.readthedocs.io/en/latest/ Most updates are removings content related to features I have removed from the Grab since 2018 year.

Documentation for old Grab version 0.6.41 (released in 2018 year) is here https://grab.readthedocs.io/en/v0.6.41-doc/

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot