Convert Figma logo to code with AI

spatie logocrawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

2,559
360
2,559
0

Top Related Projects

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Gospider - Fast web spider written in Go

23,473

Elegant Scraper and Crawler Framework for Golang

52,766

Scrapy, a fast high-level web crawling & scraping framework for Python.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

5,594

Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty

Quick Overview

The spatie/crawler is a PHP package that provides a simple and efficient way to crawl websites. It can be used to extract data from web pages, follow links, and perform various other tasks related to web scraping.

Pros

  • Flexibility: The package offers a flexible and customizable API, allowing developers to easily configure the crawling process to suit their specific needs.
  • Scalability: The crawler can handle large-scale web crawling tasks, making it suitable for a wide range of applications, from content aggregation to SEO analysis.
  • Asynchronous Processing: The package utilizes asynchronous processing, which can significantly improve the performance and efficiency of the crawling process.
  • Robust Error Handling: The crawler provides robust error handling, ensuring that the crawling process can continue even if some pages fail to load or encounter other issues.

Cons

  • Limited Functionality: While the package is powerful, it may not provide all the features and functionality that some users might require for more complex web scraping tasks.
  • Dependency on External Libraries: The package relies on several external libraries, which can increase the complexity of the setup and potentially introduce additional dependencies.
  • Learning Curve: Developers who are new to web scraping or the spatie/crawler package may need to invest some time in understanding the API and how to effectively use the package.
  • Potential Legal Concerns: Web scraping can raise legal concerns, and developers should be aware of the relevant laws and regulations in their jurisdiction before using the package.

Code Examples

Here are a few code examples demonstrating the usage of the spatie/crawler package:

  1. Basic Crawling:
use Spatie\Crawler\Crawler;

Crawler::create()
    ->startUrl('https://example.com')
    ->setCrawlObserver(new MyCrawlObserver())
    ->setDelayBetweenRequests(500)
    ->setMaximumCrawlCount(100)
    ->setMaximumDepth(3)
    ->run();

This code sets up a basic crawling process, starting from the URL https://example.com, using the MyCrawlObserver class to handle the crawling events, and setting some configuration options such as the delay between requests and the maximum crawl count and depth.

  1. Filtering URLs:
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlProfile;

class MyCustomCrawlProfile implements CrawlProfile
{
    public function shouldCrawl(string $url): bool
    {
        return strpos($url, 'example.com') !== false;
    }
}

Crawler::create()
    ->setCrawlProfile(new MyCustomCrawlProfile())
    ->startUrl('https://example.com')
    ->setCrawlObserver(new MyCrawlObserver())
    ->run();

This example demonstrates how to use a custom CrawlProfile to filter the URLs that the crawler should follow. In this case, the MyCustomCrawlProfile class only allows the crawler to follow URLs that contain the string 'example.com'.

  1. Handling Redirects:
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlProfile;

class MyCustomCrawlProfile implements CrawlProfile
{
    public function shouldCrawl(string $url): bool
    {
        return strpos($url, 'example.com') !== false;
    }

    public function shouldCrawlResponse(ResponseInterface $response): bool
    {
        return $response->getStatusCode() !== 301 && $response->getStatusCode() !== 302;
    }
}

Crawler::create()
    ->setCrawlProfile(new MyCustomCrawlProfile())
    ->startUrl('https://example.com')
    ->setCrawlObserver(new MyCrawlObserver())
    ->run();

This example demonstrates how to handle redirects using a custom CrawlProfile. The shouldCrawlResponse method is used to filter out responses with a 301 or 302 status code, which typically indicate a redirect.

Getting Started

Competitor Comparisons

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Pros of Crawler-Detect

  • Crawler-Detect is a lightweight library focused solely on detecting web crawlers, bots, and spiders.
  • It has a comprehensive database of known crawlers, making it effective at identifying a wide range of bots.
  • The library is easy to integrate into web applications and has a simple API.

Cons of Crawler-Detect

  • Crawler-Detect is not as feature-rich as spatie/crawler, which provides more advanced crawling capabilities.
  • The library is primarily focused on detection and does not provide functionality for crawling websites.
  • The database of known crawlers may not be as frequently updated as some other solutions.

Code Comparison

Crawler-Detect:

$crawler = new Crawler();
if ($crawler->isCrawler()) {
    echo "This is a bot!";
}

spatie/crawler:

$crawler = new Crawler();
$crawler->setCrawlObserver(new MyCrawlObserver());
$crawler->startCrawling('https://example.com');

Gospider - Fast web spider written in Go

Pros of gospider

  • Written in Go, potentially offering better performance and concurrency
  • Includes built-in features like subdomain enumeration and JavaScript parsing
  • Designed specifically for web security testing and bug bounty hunting

Cons of gospider

  • Less flexible and customizable compared to crawler
  • May be more complex to use for simple crawling tasks
  • Limited documentation and community support

Code Comparison

crawler (PHP):

Crawler::create()
    ->setCrawlObserver(new MyCrawlObserver)
    ->startCrawling('https://example.com');

gospider (Go):

options := &core.Options{
    Concurrent: 5,
    Depth:      2,
    ParseJS:    true,
}
crawler := core.NewCrawler(options)
crawler.Start("https://example.com")

Both repositories provide web crawling functionality, but they cater to different use cases and ecosystems. crawler is a more general-purpose PHP library with a focus on flexibility and ease of use. It's well-suited for PHP developers who need to integrate crawling capabilities into their applications.

gospider, on the other hand, is a specialized tool written in Go, targeting security researchers and bug bounty hunters. It offers built-in features specific to web security testing, making it more suitable for those scenarios. However, it may be less adaptable for general-purpose crawling tasks compared to crawler.

The choice between the two depends on the specific requirements of your project, your preferred programming language, and whether you need the specialized security-focused features of gospider or the more flexible approach of crawler.

23,473

Elegant Scraper and Crawler Framework for Golang

Pros of colly

  • Written in Go, offering better performance and concurrency
  • Lightweight and easy to use with a simple API
  • Supports distributed scraping out of the box

Cons of colly

  • Less feature-rich compared to crawler's extensive options
  • Limited built-in support for JavaScript rendering
  • Smaller community and ecosystem than PHP-based crawler

Code Comparison

crawler (PHP):

Crawler::create()
    ->setCrawlObserver(new MyCrawlObserver)
    ->startCrawling('https://example.com');

colly (Go):

c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    e.Request.Visit(e.Attr("href"))
})
c.Visit("https://example.com")

Both libraries provide a straightforward way to start crawling a website. crawler offers a more declarative approach with its fluent interface, while colly uses callback functions for handling different aspects of the crawling process. colly's code is more concise but may require more setup for complex scenarios.

52,766

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

  • More comprehensive and feature-rich, offering advanced capabilities like item pipelines and middleware
  • Supports multiple output formats (JSON, CSV, XML) out of the box
  • Has a larger community and ecosystem, with more extensions and plugins available

Cons of Scrapy

  • Steeper learning curve due to its complexity and Python-specific concepts
  • Heavier and potentially slower for simple scraping tasks
  • Requires more setup and configuration for basic use cases

Code Comparison

Crawler (PHP):

$crawler = Crawler::create()
    ->setCrawlObserver(new MyCrawlObserver)
    ->startCrawling('https://example.com');

Scrapy (Python):

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Parsing logic here

Crawler is more straightforward for simple tasks, while Scrapy requires more boilerplate but offers greater flexibility. Crawler is ideal for PHP developers seeking a lightweight solution, whereas Scrapy is better suited for complex, large-scale scraping projects in Python.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Pros of Heritrix3

  • More robust and scalable, designed for large-scale web archiving
  • Offers advanced features like deduplication and adaptive crawling
  • Provides a web-based user interface for easier management

Cons of Heritrix3

  • Steeper learning curve and more complex setup
  • Requires more system resources due to its comprehensive nature
  • Less suitable for small-scale or quick crawling tasks

Code Comparison

Crawler (PHP):

Crawler::create()
    ->setCrawlObserver(new MyCrawlObserver)
    ->startCrawling('https://example.com');

Heritrix3 (Java):

CrawlJob job = new CrawlJob();
job.setName("MyJob");
job.addSeed("https://example.com");
job.launch();

Key Differences

  • Crawler is lightweight and easy to integrate into PHP projects
  • Heritrix3 is more feature-rich but requires Java and additional setup
  • Crawler is better for small to medium-scale crawling tasks
  • Heritrix3 excels in large-scale, archival-quality web crawling

Use Cases

  • Choose Crawler for simple web scraping or site auditing in PHP environments
  • Opt for Heritrix3 for comprehensive web archiving or large-scale data collection projects
5,594

Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty

Pros of Cabot

  • Comprehensive monitoring solution with web interface and alerting capabilities
  • Supports multiple service checks (HTTP, Jenkins, Graphite metrics)
  • Integrates with various notification channels (email, SMS, Slack)

Cons of Cabot

  • More complex setup and configuration compared to Crawler
  • Focused on monitoring rather than web crawling
  • Less flexibility for custom crawling logic

Code Comparison

Crawler (PHP):

Crawler::create()
    ->setCrawlObserver(new MyCrawlObserver)
    ->startCrawling('https://example.com');

Cabot (Python):

from cabot.cabotapp.models import Service

service = Service.objects.create(
    name="My Website",
    url="https://example.com",
    status_checks=[HttpStatusCheck.objects.create(endpoint="/")]
)

While Crawler is designed for web crawling tasks, Cabot is built for monitoring and alerting. Crawler offers a simpler API for crawling websites, whereas Cabot provides a more comprehensive solution for monitoring various services and metrics. The code examples illustrate the difference in focus: Crawler initializes a crawl job, while Cabot sets up a service to be monitored.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

🕸 Crawl the web using PHP 🕷

Latest Version on Packagist MIT Licensed Tests Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

Support us

We invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.

We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:

namespace Spatie\Crawler\CrawlObservers;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

abstract class CrawlObserver
{
    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    abstract public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText,
    ): void;

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    abstract public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void;

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
    }
}

Using multiple observers

You can set multiple observers with setCrawlObservers:

Crawler::create()
    ->setCrawlObservers([
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        ...
     ])
    ->startCrawling($url);

Alternatively you can set multiple observers one by one with addCrawlObserver:

Crawler::create()
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

Executing JavaScript

By default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

Crawler::create()
    ->executeJavaScript()
    ...

In order to make it possible to get the body html after the javascript has been executed, this package depends on our Browsershot package. This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.

Browsershot will make an educated guess as to where its dependencies are installed on your system. By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.

Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...

Note that the crawler will still work even if you don't have the system dependencies required by Browsershot. These system dependencies are only required if you're calling executeJavaScript().

Filtering certain urls

You can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expects an object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;

This package comes with three CrawlProfiles out of the box:

  • CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
  • CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.
  • CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.

Custom link extraction

You can customize how links are extracted from a page by passing a custom UrlParser to the crawler.

Crawler::create()
    ->setUrlParserClass(<class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class)
    ...

By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.

There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.

Crawler::create()
    ->setUrlParserClass(SitemapUrlParser::class)
    ...

Ignoring robots.txt and robots meta

By default, the crawler will respect robots data. It is possible to disable these checks like so:

Crawler::create()
    ->ignoreRobots()
    ...

Robots data can come from either a robots.txt file, meta tags or response headers. More information on the spec can be found here: http://www.robotstxt.org/.

Parsing robots data is done by our package spatie/robots-txt.

Accept links with rel="nofollow" attribute

By default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:

Crawler::create()
    ->acceptNofollowLinks()
    ...

Using a custom User Agent

In order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.

Crawler::create()
    ->setUserAgent('my-agent')

You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.

// Disallow crawling for my-agent
User-agent: my-agent
Disallow: /

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Crawler::create()
    ->setConcurrency(1) // now all urls will be crawled one by one

Defining Crawl and Time Limits

By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.

The crawl behavior can be controlled with the following two options:

  • Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.
  • Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.
  • Total Execution Time Limit (setTotalExecutionTimeLimit): This limit defines the maximal execution time of the crawl.
  • Current Execution Time Limit (setCurrentExecutionTimeLimit): This limits the execution time of the current crawl.

Let's take a look at some examples to clarify the difference between setTotalCrawlLimit and setCurrentCrawlLimit. The difference between setTotalExecutionTimeLimit and setCurrentExecutionTimeLimit will be the same.

Example 1: Using the total crawl limit

The setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

Example 2: Using the current crawl limit

The setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total limit of pages to crawl.

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

Example 3: Combining the total and crawl limit

Both limits can be combined to control the crawler:

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

Example 4: Crawling across requests

You can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.

Initial Request

To start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).

// Create a queue using your queue-driver.
$queue = <your selection/implementation of a queue>;

// Crawl the first set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serializedQueue = serialize($queue);

Subsequent Requests

For any following requests you will need to unserialize your original queue and pass it to the crawler:

// Unserialize queue
$queue = unserialize($serializedQueue);

// Crawls the next set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serialized_queue = serialize($queue);

The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.

An example with more details can be found here.

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Crawler::create()
    ->setMaximumDepth(2)

Setting the maximum response size

Most html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.

You can change the maximum response size.

// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)

Add a delay between requests

In some cases you might get rate-limited when crawling too aggressively. To circumvent this, you can use the setDelayBetweenRequests() method to add a pause between every request. This value is expressed in milliseconds.

Crawler::create()
    ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms

Limiting which content-types to parse

By default, every found page will be downloaded (up to setMaximumResponseSize() in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the setParseableMimeTypes() with an array of allowed types.

Crawler::create()
    ->setParseableMimeTypes(['text/html', 'text/plain'])

This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.

A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueues\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Crawler::create()
    ->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueues\CrawlQueue>)

Here

Change the default base url scheme

By default, the crawler will set the base url scheme to http if none. You have the ability to change that with setDefaultScheme.

Crawler::create()
    ->setDefaultScheme('https')

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

First, install the Puppeteer dependency, or your tests will fail.

npm install puppeteer

To run the tests you'll have to start the included node based server first in a separate terminal window.

cd tests/server
npm install
node server.js

With the server running, you can start testing.

composer test

Security

If you've found a bug regarding security please mail security@spatie.be instead of using the issue tracker.

Postcardware

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.

We publish all received postcards on our company website.

Credits

License

The MIT License (MIT). Please see License File for more information.