Convert Figma logo to code with AI

FriendsOfPHP logoGoutte

Goutte, a simple PHP Web Scraper

9,263
1,007
9,263
138

Top Related Projects

Simulates the behavior of a web browser, allowing you to make requests, click on links and submit forms programmatically

23,156

Guzzle, an extensible PHP HTTP client

2,559

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

1,885

Laravel Dusk provides simple end-to-end testing and browser automation.

8,218

Twig, the flexible, fast, and secure template language for PHP

9,263

Goutte, a simple PHP Web Scraper

Quick Overview

Goutte is a PHP web scraping library that provides a simple yet powerful interface for extracting data from websites. It allows developers to programmatically navigate web pages, fill out forms, and extract content using CSS selectors or XPath expressions.

Pros

  • Easy to use with a clean and intuitive API
  • No external dependencies, making it lightweight and portable
  • Supports both GET and POST requests, as well as form submission
  • Integrates well with other PHP libraries and frameworks

Cons

  • Limited JavaScript support, as it doesn't execute JavaScript on pages
  • May not handle complex, dynamic websites as effectively as browser-based scrapers
  • Lacks built-in features for handling CAPTCHAs or IP rotation
  • Not actively maintained, with the last release in 2021

Code Examples

  1. Basic usage to scrape a webpage:
use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');

$title = $crawler->filter('h1')->text();
echo $title;
  1. Submitting a form:
$crawler = $client->request('GET', 'https://example.com/form');
$form = $crawler->selectButton('Submit')->form();
$crawler = $client->submit($form, ['username' => 'john', 'password' => 'secret']);
  1. Extracting multiple elements:
$crawler->filter('.product')->each(function ($node) {
    echo $node->filter('.name')->text() . "\n";
    echo $node->filter('.price')->text() . "\n";
});

Getting Started

  1. Install Goutte using Composer:
composer require fabpot/goutte
  1. Create a new PHP file and add the following code:
<?php

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');

// Use $crawler to extract data from the page
$title = $crawler->filter('title')->text();
echo "Page title: " . $title;
  1. Run the PHP file to see the results.

Competitor Comparisons

Simulates the behavior of a web browser, allowing you to make requests, click on links and submit forms programmatically

Pros of browser-kit

  • More lightweight and focused on core functionality
  • Better integration with other Symfony components
  • More frequent updates and active maintenance

Cons of browser-kit

  • Less feature-rich out of the box compared to Goutte
  • Requires more setup and configuration for advanced use cases
  • May need additional dependencies for certain functionalities

Code Comparison

browser-kit:

use Symfony\Component\BrowserKit\HttpBrowser;

$browser = new HttpBrowser();
$crawler = $browser->request('GET', 'https://example.com');
$content = $crawler->filter('.content')->text();

Goutte:

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
$content = $crawler->filter('.content')->text();

Both libraries provide similar basic functionality for making HTTP requests and parsing HTML content. However, browser-kit is more focused on core functionality and integrates better with other Symfony components, while Goutte offers a more feature-rich experience out of the box. The choice between the two depends on specific project requirements and the desired level of integration with the Symfony ecosystem.

23,156

Guzzle, an extensible PHP HTTP client

Pros of Guzzle

  • More feature-rich and flexible HTTP client library
  • Supports both synchronous and asynchronous requests
  • Extensive middleware system for customizing request/response handling

Cons of Guzzle

  • Steeper learning curve due to more complex API
  • Requires more setup and configuration for basic scraping tasks
  • May be overkill for simple web scraping projects

Code Comparison

Goutte:

$client = new Goutte\Client();
$crawler = $client->request('GET', 'https://example.com');
$title = $crawler->filter('h1')->text();

Guzzle:

$client = new GuzzleHttp\Client();
$response = $client->request('GET', 'https://example.com');
$body = $response->getBody();
// Additional parsing required to extract data

Goutte is built on top of Guzzle and provides a simpler API for web scraping tasks. It includes a built-in DOM crawler and selector system, making it easier to extract data from HTML responses. Guzzle, on the other hand, is a more general-purpose HTTP client library that offers greater flexibility and control over requests and responses. While Guzzle requires additional setup for web scraping, it's better suited for complex HTTP interactions and API integrations.

2,559

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Pros of Crawler

  • Built-in support for concurrent requests, improving crawling speed
  • Offers more customization options and flexibility
  • Provides a modern, Laravel-style API

Cons of Crawler

  • Steeper learning curve due to more complex API
  • Requires more setup and configuration for basic crawling tasks
  • Heavier dependency footprint

Code Comparison

Goutte:

$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.product')->each(function ($node) {
    echo $node->text() . "\n";
});

Crawler:

Crawler::create()
    ->setCrawlObserver(new MyCrawlObserver)
    ->startCrawling('https://example.com');

class MyCrawlObserver extends CrawlObserver
{
    public function crawled(UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null): void
    {
        $html = (string) $response->getBody();
        // Process the HTML here
    }
}

Both Goutte and Crawler are PHP web scraping libraries, but they differ in their approach and features. Goutte is simpler and easier to get started with, while Crawler offers more advanced features and customization options. The choice between them depends on the complexity of your scraping needs and your familiarity with Laravel-style coding.

1,885

Laravel Dusk provides simple end-to-end testing and browser automation.

Pros of Dusk

  • Built-in support for Laravel applications, offering seamless integration
  • Provides a high-level, expressive API for browser automation and testing
  • Includes built-in waiting mechanisms for AJAX and JavaScript interactions

Cons of Dusk

  • Limited to Laravel ecosystem, not suitable for non-Laravel projects
  • Requires more setup and resources due to browser automation

Code Comparison

Goutte (simple HTTP request):

$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.content')->each(function ($node) {
    print $node->text()."\n";
});

Dusk (browser automation):

$browser->visit('https://example.com')
        ->waitFor('.content')
        ->assertSee('Expected Text')
        ->screenshot('screenshot');

Summary

Goutte is a lightweight, standalone web scraper suitable for simple HTTP requests and parsing. It's framework-agnostic and requires minimal setup. Dusk, on the other hand, is a full-featured browser automation tool specifically designed for Laravel applications. It offers more advanced features like JavaScript interaction and screenshot capture but comes with a heavier setup process and is limited to the Laravel ecosystem.

8,218

Twig, the flexible, fast, and secure template language for PHP

Pros of Twig

  • Powerful templating engine with extensive features for complex layouts
  • Excellent documentation and large community support
  • Secure by design, with automatic output escaping

Cons of Twig

  • Steeper learning curve compared to Goutte's simpler scraping approach
  • Requires additional setup and configuration for web scraping tasks
  • May be overkill for basic web scraping needs

Code Comparison

Twig (template rendering):

{% for user in users %}
    <li>{{ user.name }}</li>
{% endfor %}

Goutte (web scraping):

$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.user-name')->each(function ($node) {
    echo $node->text() . "\n";
});

Summary

Twig is a robust templating engine designed for creating complex web layouts and managing content presentation. It offers powerful features and excellent security but may be more complex for simple scraping tasks.

Goutte, on the other hand, is specifically designed for web scraping and provides a simpler, more straightforward approach to extracting data from websites. It's easier to set up and use for basic scraping needs but lacks the advanced templating capabilities of Twig.

Choose Twig for complex web development projects requiring advanced templating, and Goutte for straightforward web scraping tasks.

9,263

Goutte, a simple PHP Web Scraper

Pros of Goutte

  • Lightweight and easy to use web scraping library
  • Integrates well with Symfony components
  • Supports both synchronous and asynchronous requests

Cons of Goutte

  • Limited functionality compared to more comprehensive scraping frameworks
  • Lacks built-in JavaScript rendering capabilities
  • May require additional libraries for complex scraping tasks

Code Comparison

Both repositories are the same, so there's no code difference to compare. Here's a sample usage of Goutte:

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');

$crawler->filter('.post-title')->each(function ($node) {
    print $node->text()."\n";
});

Summary

Goutte is a popular PHP web scraping library maintained by the Friends of PHP community. It provides a simple and efficient way to extract data from websites. While it may lack some advanced features found in more comprehensive scraping frameworks, its lightweight nature and ease of use make it a good choice for many basic scraping tasks.

The comparison between the two repositories is not applicable in this case, as they are the same project. The repository FriendsOfPHP/Goutte is the official and only repository for the Goutte library.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Goutte, a simple PHP Web Scraper

Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

WARNING: This library is deprecated. As of v4, Goutte became a simple proxy to the HttpBrowser class <https://symfony.com/doc/current/components/browser_kit.html#making-external-http-requests>_ from the Symfony BrowserKit <https://symfony.com/browser-kit>_ component. To migrate, replace Goutte\Client by Symfony\Component\BrowserKit\HttpBrowser in your code.

Requirements

Goutte depends on PHP 7.1+.

Installation

Add fabpot/goutte as a require dependency in your composer.json file:

.. code-block:: bash

composer require fabpot/goutte

Usage

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\HttpBrowser):

.. code-block:: php

use Goutte\Client;

$client = new Client();

Make requests with the request() method:

.. code-block:: php

// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');

The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

To use your own HTTP settings, you may create and pass an HttpClient instance to Goutte. For example, to add a 60 second request timeout:

.. code-block:: php

use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

$client = new Client(HttpClient::create(['timeout' => 60]));

Click on links:

.. code-block:: php

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

Extract data:

.. code-block:: php

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

Submit forms:

.. code-block:: php

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'fabpot', 'password' => 'xxxxxx']);
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

More Information

Read the documentation of the BrowserKit, DomCrawler, and HttpClient_ Symfony Components for more information about what you can do with Goutte.

Pronunciation

Goutte is pronounced goot i.e. it rhymes with boot and not out.

Technical Information

Goutte is a thin wrapper around the following Symfony Components: BrowserKit, CssSelector, DomCrawler, and HttpClient.

License

Goutte is licensed under the MIT license.

.. _Composer: https://getcomposer.org .. _BrowserKit: https://symfony.com/components/BrowserKit .. _DomCrawler: https://symfony.com/doc/current/components/dom_crawler.html .. _CssSelector: https://symfony.com/doc/current/components/css_selector.html .. _HttpClient: https://symfony.com/doc/current/components/http_client.html