хотите помочь? Вот ваши варианты:","Crunchbase","О нас","Спасибо всем за потрясающую поддержку!","Быстрые ссылки","Партнерская программа","Премиум","ProxyScrape премиум-проба","Проверка прокси-сервера онлайн","Типы прокси-серверов","Страны-посредники","Примеры использования прокси-сервера","Важно","Политика использования файлов cookie","Отказ от ответственности","Политика конфиденциальности","Условия и положения","Социальные сети","Facebook","LinkedIn","Twitter","Quora","Telegram","Дискорд","\n © Copyright 2024 - Thib BV | Brugstraat 18 | 2812 Mechelen | Belgium | VAT BE 0749 716 760\n"]}
Web scraping has become an essential tool for developers and data analysts who need to extract and analyze information from the web. Whether you're tracking product prices, collecting data for research, or building a customized dashboard, web scraping offers endless possibilities.
If you're a PHP enthusiast, Goutte is a fantastic library to consider for your web scraping needs. Goutte is lightweight, user-friendly, and powerful, combining Guzzle’s HTTP client capabilities with Symfony's DomCrawler for smooth and efficient web scraping.
This guide will take you through the basics of web scraping with PHP using Goutte—from installation and your first script to advanced techniques like form handling and pagination.
Goutte has gained popularity among developers for a number of reasons, making it one of the go-to scraping libraries for PHP:
Whether you're new to PHP or a seasoned developer, Goutte strikes an ideal balance between simplicity and power.
Before jumping into coding, ensure the necessary prerequisites are in place:
To install Goutte, simply run the following command in your terminal:
composer require fabpot/goutte
Once installed, verify the library is accessible by requiring Composer’s autoloader in your project:
require 'vendor/autoload.php';
Now you’re ready to start scraping!
Let's begin with a simple example. We'll scrape the title of a webpage using Goutte. Below is the basic script:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
// Initialize Goutte Client
$client = new Client();
// Send a GET request to the target URL
$crawler = $client->request('GET', 'https://books.toscrape.com/');
// Extract the title of the page
$title = $crawler->filter('title')->text();
echo "Page Title: $title\n";
// Extract the titles of the first 5 books
echo "First 5 Book Titles:\n";
$crawler->filter('.product_pod h3 a')->slice(0, 5)->each(function ($node) {
echo "- " . $node->attr('title') . "\n";
});
?>
Выход:
Page Title: All products | Books to Scrape - Sandbox
First 5 Book Titles:
- A Light in the Attic
- Tipping the Velvet
- Soumission
- Sharp Objects
- Sapiens: A Brief History of Humankind
It’s as easy as that! With just a few lines of code, you can fetch and display the название
tag of any webpage.
Once you've learned how to fetch a webpage, the next step is extracting specific data such as links or content from specific HTML elements.
The following script extracts the href
attributes of all <a>
tags on a webpage:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');
// Extract all <a> tags
$links = $crawler->filter('a')->each(function ($node) {
return $node->attr('href');
});
// Print all extracted links
foreach ($links as $link) {
echo $link . "\n";
}
This will return all the hyperlinks present on the page.
Goutte makes it easy to extract or parse data from HTML using class
or ID
selectors. For this example, we’ll use the Books to Scrape website. Specifically, we’ll scrape information about each book, as they all share the same class, product_pod
. Here’s how it appears on the website:
Here’s an example of how you can achieve this using Goutte:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');
// Extract all <a> tags
$links = $crawler->filter('a')->each(function ($node) {
return $node->attr('href');
});
// Print all extracted links
foreach ($links as $link) {
echo $link . "\n";
}
// Extract elements with class 'product_pod'
$products = $crawler->filter('.product_pod')->each(function ($node) {
return $node->text();
});
// Print all extracted product details
foreach ($products as $product) {
echo $product . "\n";
}
Now, let’s explore how to navigate or paginate between pages. In the example page we’re using, there’s a "Next" button that allows pagination to the next page. We’ll leverage this button to implement pagination.
First, we’ll locate the button using its class
attribute that has as value следующий
. Within this element, there’s an <a>
tag containing the URL for the next page. By extracting this URL, we can use it to send a new request and seamlessly move to the next page.Here is the appearance and HTML structure of the следующий
button on the page.
Here’s what the code that achieves this looks like:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');
// Handle pagination using the 'next' button
while ($crawler->filter('li.next a')->count() > 0) {
$nextLink = $crawler->filter('li.next a')->attr('href');
$crawler = $client->request('GET', 'https://books.toscrape.com/catalogue/' . $nextLink);
// Extract and print the current page URL
echo "Currently on: " . $crawler->getUri() . "\n";
}
With this approach, you can automate the navigation between pages and continue scraping data.
Goutte is also capable of handling forms. To demonstrate this functionality, we’ll use this website, which has a single input field, as shown in the image below:
Here’s what the code for submitting this form looks like:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://www.scrapethissite.com/pages/forms/');
// Submit the search form with a query
$form = $crawler->selectButton('Search')->form();
$form['q'] = 'Canada';
$crawler = $client->submit($form);
// Extract and print the results
$results = $crawler->filter('.team')->each(function ($node) {
return $node->text();
});
foreach ($results as $result) {
echo $result . "\n";
}
This script fills out a form field named q
with the value web scraping
and submits it. From here, you can extract content from the search results page just like in the earlier examples.
Always add error handling to manage unexpected situations like a failed network connection or non-existent URLs.
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
try {
$crawler = $client->request('GET', 'https://invalid-url-example.com');
echo "Page title: " . $crawler->filter('title')->text();
} catch (Exception $e) {
echo "An error occurred: " . $e->getMessage();
}
}
Web scraping should always be performed ethically and responsibly. The `robots.txt` file is a simple text file used by websites to communicate with web crawlers, outlining which parts of the site can or cannot be accessed. Before scraping, it's important to check the `robots.txt` file to ensure you're following the site's rules and respecting their terms. Ignoring these guidelines can lead to legal and ethical issues, so always make this step a priority in your scraping process.
Read More about robots.txt
here.
Be courteous and avoid sending too many requests in a short period of time, as this can overwhelm the server and disrupt its performance for other users. It's a good practice to include a short delay between each request to minimize the load on the server and ensure it can handle traffic efficiently. Taking these steps not only helps maintain server stability but also demonstrates responsible and considerate usage of shared resources.
sleep(1); // Wait 1 second between requests
Web scraping is a powerful tool for gathering data efficiently, but it requires a responsible and thoughtful approach to avoid common pitfalls and ensure ethical usage. By adhering to best practices such as respecting website terms of service, implementing appropriate delays between requests, and using tools capable of handling dynamic content, you can create a scraper that performs effectively while minimizing impact on servers. Additionally, verifying HTTPS certificates and staying mindful of security considerations will protect your scraper and any data it collects. With proper planning and execution, web scraping can become an invaluable resource for research, analysis, and innovation.