Top Related Projects
Simulates the behavior of a web browser, allowing you to make requests, click on links and submit forms programmatically
Guzzle, an extensible PHP HTTP client
An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
Laravel Dusk provides simple end-to-end testing and browser automation.
Twig, the flexible, fast, and secure template language for PHP
Goutte, a simple PHP Web Scraper
Quick Overview
Goutte is a PHP web scraping library that provides a simple yet powerful interface for extracting data from websites. It allows developers to programmatically navigate web pages, fill out forms, and extract content using CSS selectors or XPath expressions.
Pros
- Easy to use with a clean and intuitive API
- No external dependencies, making it lightweight and portable
- Supports both GET and POST requests, as well as form submission
- Integrates well with other PHP libraries and frameworks
Cons
- Limited JavaScript support, as it doesn't execute JavaScript on pages
- May not handle complex, dynamic websites as effectively as browser-based scrapers
- Lacks built-in features for handling CAPTCHAs or IP rotation
- Not actively maintained, with the last release in 2021
Code Examples
- Basic usage to scrape a webpage:
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
$title = $crawler->filter('h1')->text();
echo $title;
- Submitting a form:
$crawler = $client->request('GET', 'https://example.com/form');
$form = $crawler->selectButton('Submit')->form();
$crawler = $client->submit($form, ['username' => 'john', 'password' => 'secret']);
- Extracting multiple elements:
$crawler->filter('.product')->each(function ($node) {
echo $node->filter('.name')->text() . "\n";
echo $node->filter('.price')->text() . "\n";
});
Getting Started
- Install Goutte using Composer:
composer require fabpot/goutte
- Create a new PHP file and add the following code:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
// Use $crawler to extract data from the page
$title = $crawler->filter('title')->text();
echo "Page title: " . $title;
- Run the PHP file to see the results.
Competitor Comparisons
Simulates the behavior of a web browser, allowing you to make requests, click on links and submit forms programmatically
Pros of browser-kit
- More lightweight and focused on core functionality
- Better integration with other Symfony components
- More frequent updates and active maintenance
Cons of browser-kit
- Less feature-rich out of the box compared to Goutte
- Requires more setup and configuration for advanced use cases
- May need additional dependencies for certain functionalities
Code Comparison
browser-kit:
use Symfony\Component\BrowserKit\HttpBrowser;
$browser = new HttpBrowser();
$crawler = $browser->request('GET', 'https://example.com');
$content = $crawler->filter('.content')->text();
Goutte:
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
$content = $crawler->filter('.content')->text();
Both libraries provide similar basic functionality for making HTTP requests and parsing HTML content. However, browser-kit is more focused on core functionality and integrates better with other Symfony components, while Goutte offers a more feature-rich experience out of the box. The choice between the two depends on specific project requirements and the desired level of integration with the Symfony ecosystem.
Guzzle, an extensible PHP HTTP client
Pros of Guzzle
- More feature-rich and flexible HTTP client library
- Supports both synchronous and asynchronous requests
- Extensive middleware system for customizing request/response handling
Cons of Guzzle
- Steeper learning curve due to more complex API
- Requires more setup and configuration for basic scraping tasks
- May be overkill for simple web scraping projects
Code Comparison
Goutte:
$client = new Goutte\Client();
$crawler = $client->request('GET', 'https://example.com');
$title = $crawler->filter('h1')->text();
Guzzle:
$client = new GuzzleHttp\Client();
$response = $client->request('GET', 'https://example.com');
$body = $response->getBody();
// Additional parsing required to extract data
Goutte is built on top of Guzzle and provides a simpler API for web scraping tasks. It includes a built-in DOM crawler and selector system, making it easier to extract data from HTML responses. Guzzle, on the other hand, is a more general-purpose HTTP client library that offers greater flexibility and control over requests and responses. While Guzzle requires additional setup for web scraping, it's better suited for complex HTTP interactions and API integrations.
An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
Pros of Crawler
- Built-in support for concurrent requests, improving crawling speed
- Offers more customization options and flexibility
- Provides a modern, Laravel-style API
Cons of Crawler
- Steeper learning curve due to more complex API
- Requires more setup and configuration for basic crawling tasks
- Heavier dependency footprint
Code Comparison
Goutte:
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.product')->each(function ($node) {
echo $node->text() . "\n";
});
Crawler:
Crawler::create()
->setCrawlObserver(new MyCrawlObserver)
->startCrawling('https://example.com');
class MyCrawlObserver extends CrawlObserver
{
public function crawled(UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null): void
{
$html = (string) $response->getBody();
// Process the HTML here
}
}
Both Goutte and Crawler are PHP web scraping libraries, but they differ in their approach and features. Goutte is simpler and easier to get started with, while Crawler offers more advanced features and customization options. The choice between them depends on the complexity of your scraping needs and your familiarity with Laravel-style coding.
Laravel Dusk provides simple end-to-end testing and browser automation.
Pros of Dusk
- Built-in support for Laravel applications, offering seamless integration
- Provides a high-level, expressive API for browser automation and testing
- Includes built-in waiting mechanisms for AJAX and JavaScript interactions
Cons of Dusk
- Limited to Laravel ecosystem, not suitable for non-Laravel projects
- Requires more setup and resources due to browser automation
Code Comparison
Goutte (simple HTTP request):
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.content')->each(function ($node) {
print $node->text()."\n";
});
Dusk (browser automation):
$browser->visit('https://example.com')
->waitFor('.content')
->assertSee('Expected Text')
->screenshot('screenshot');
Summary
Goutte is a lightweight, standalone web scraper suitable for simple HTTP requests and parsing. It's framework-agnostic and requires minimal setup. Dusk, on the other hand, is a full-featured browser automation tool specifically designed for Laravel applications. It offers more advanced features like JavaScript interaction and screenshot capture but comes with a heavier setup process and is limited to the Laravel ecosystem.
Twig, the flexible, fast, and secure template language for PHP
Pros of Twig
- Powerful templating engine with extensive features for complex layouts
- Excellent documentation and large community support
- Secure by design, with automatic output escaping
Cons of Twig
- Steeper learning curve compared to Goutte's simpler scraping approach
- Requires additional setup and configuration for web scraping tasks
- May be overkill for basic web scraping needs
Code Comparison
Twig (template rendering):
{% for user in users %}
<li>{{ user.name }}</li>
{% endfor %}
Goutte (web scraping):
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.user-name')->each(function ($node) {
echo $node->text() . "\n";
});
Summary
Twig is a robust templating engine designed for creating complex web layouts and managing content presentation. It offers powerful features and excellent security but may be more complex for simple scraping tasks.
Goutte, on the other hand, is specifically designed for web scraping and provides a simpler, more straightforward approach to extracting data from websites. It's easier to set up and use for basic scraping needs but lacks the advanced templating capabilities of Twig.
Choose Twig for complex web development projects requiring advanced templating, and Goutte for straightforward web scraping tasks.
Goutte, a simple PHP Web Scraper
Pros of Goutte
- Lightweight and easy to use web scraping library
- Integrates well with Symfony components
- Supports both synchronous and asynchronous requests
Cons of Goutte
- Limited functionality compared to more comprehensive scraping frameworks
- Lacks built-in JavaScript rendering capabilities
- May require additional libraries for complex scraping tasks
Code Comparison
Both repositories are the same, so there's no code difference to compare. Here's a sample usage of Goutte:
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('.post-title')->each(function ($node) {
print $node->text()."\n";
});
Summary
Goutte is a popular PHP web scraping library maintained by the Friends of PHP community. It provides a simple and efficient way to extract data from websites. While it may lack some advanced features found in more comprehensive scraping frameworks, its lightweight nature and ease of use make it a good choice for many basic scraping tasks.
The comparison between the two repositories is not applicable in this case, as they are the same project. The repository FriendsOfPHP/Goutte is the official and only repository for the Goutte library.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Goutte, a simple PHP Web Scraper
Goutte is a screen scraping and web crawling library for PHP.
Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.
WARNING: This library is deprecated. As of v4, Goutte became a simple proxy
to the HttpBrowser class <https://symfony.com/doc/current/components/browser_kit.html#making-external-http-requests>
_
from the Symfony BrowserKit <https://symfony.com/browser-kit>
_ component. To
migrate, replace Goutte\Client
by
Symfony\Component\BrowserKit\HttpBrowser
in your code.
Requirements
Goutte depends on PHP 7.1+.
Installation
Add fabpot/goutte
as a require dependency in your composer.json
file:
.. code-block:: bash
composer require fabpot/goutte
Usage
Create a Goutte Client instance (which extends
Symfony\Component\BrowserKit\HttpBrowser
):
.. code-block:: php
use Goutte\Client;
$client = new Client();
Make requests with the request()
method:
.. code-block:: php
// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');
The method returns a Crawler
object
(Symfony\Component\DomCrawler\Crawler
).
To use your own HTTP settings, you may create and pass an HttpClient instance to Goutte. For example, to add a 60 second request timeout:
.. code-block:: php
use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;
$client = new Client(HttpClient::create(['timeout' => 60]));
Click on links:
.. code-block:: php
// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
Extract data:
.. code-block:: php
// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."\n";
});
Submit forms:
.. code-block:: php
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'fabpot', 'password' => 'xxxxxx']);
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});
More Information
Read the documentation of the BrowserKit
, DomCrawler
, and HttpClient
_
Symfony Components for more information about what you can do with Goutte.
Pronunciation
Goutte is pronounced goot
i.e. it rhymes with boot
and not out
.
Technical Information
Goutte is a thin wrapper around the following Symfony Components:
BrowserKit
, CssSelector
, DomCrawler
, and HttpClient
.
License
Goutte is licensed under the MIT license.
.. _Composer
: https://getcomposer.org
.. _BrowserKit
: https://symfony.com/components/BrowserKit
.. _DomCrawler
: https://symfony.com/doc/current/components/dom_crawler.html
.. _CssSelector
: https://symfony.com/doc/current/components/css_selector.html
.. _HttpClient
: https://symfony.com/doc/current/components/http_client.html
Top Related Projects
Simulates the behavior of a web browser, allowing you to make requests, click on links and submit forms programmatically
Guzzle, an extensible PHP HTTP client
An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
Laravel Dusk provides simple end-to-end testing and browser automation.
Twig, the flexible, fast, and secure template language for PHP
Goutte, a simple PHP Web Scraper
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot