Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.
Oct 27, 2019 Web scraping is a process where the data is present in the Html format and we filter out the required data and persist it. The above is a general definition. The more precise one that I think is t. In this video we will take a look at the Node.js library, Cheerio which is a jQuery like tool for the server used in web scraping. This is similar to the pyt.
What is Web Scraping?
Have you ever needed to grab some data from a site that doesn’t provide a public API? To solve this problem we can use web scraping and pull the required information out from the HTML. Of course, we can manually extract the required data from a website, but this process can become very tedious. So, it will be more efficient to automate it via the scraper.
Well, in this tutorial we are going to scrape cats images from Pexels. This website provides high quality and completely free stock photos. They have a public API but it has a limit of 200 requests per hour.
Making concurrent requests
The main advantage of using asynchronous PHP in web scraping is that we can make a lot of work in less time. Instead of querying each web page one by one and waiting for responses we can request as many pages as we want at once. Thus we can start processing the results as soon as they arrive.
Let’s start with pulling an asynchronous HTTP client called buzz-react – a simple, async HTTP client for concurrently processing any number of HTTP requests, built on top of ReactPHP:
Now, we are ready and let’s request an image page on pexels:
We have created an instance of ClueReactBuzzBrowser
, then we have used it as HTTP client. The code above makes an asynchronous GET
request to a web page with an image of kittens. Method $client->get($url)
returns a promise that resolves with a PSR-7 response object.
The client works asynchronously, that means that we can easily request several pages and these requests will be performed concurrently:
The idea is here the following: Scotts fertilizer spreader manual.
- make a request
- get a promise
- add a handler to a promise
- once the promise is resolved, process the response
So, this logic can be extracted to a class, thus we can easily request many URLs and add the same response handler for them. Let’s create a wrapper over the Browser
.
Create a class called Scraper
with the following content:
We inject Browser
as a constructor dependency and provide one public method scrape(array $urls)
. Then for each specified URL we make a GET
request. Once the response is done we call a private method processResponse(string $html)
with the body of the response. Nokia x2 02 rm 694 flash file free download. This method will be responsible for traversing HTML code and downloading images. The next step is to inspect the received HTML code and extract images from it.
Crawling the website
At this moment we are getting only HTML code of the requested page. Now we need to extract the image URL. For this, we need to examine the structure of the received HTML code. Go to an image page on Pexels, right click on the image and select Inspect Element, you will see something like this:
We can see that img
tag has class image-section__image
. We are going to use this information to extract this tag out of the received HTML. The URL of the image is stored in the src
attribute:
For extracting HTML tags we are going to use Symfony DomCrawler Component. Pull the required packages:
CSS-selector for DomCrawler allows us to use jQuery-like selectors to traverse the DOM. Once everything is installed open our Scraper
class and let’s write some code in processResponse(string $html)
method. First of all, we need to create an instance of the SymfonyComponentDomCrawlerCrawler
class, its constructor accepts a string that contains HTML code for traversing:
To find any element by its jQuery-like selector use filter()
method. Then method attr($attribute)
allows to extract an attribute of the filtered element:
Let’s just print the extracted image URL and check that our scraper works as expected:
When running this script it will output the full URL to the required image. Then we can use this URL to download the image. Again we use an instance of the Browser
and make a GET
request:
The response arrives with the contents of the requested image. Now we need to save it on disk. But take your time and don’t use file_put_contents()
. All native PHP functions that work with a file system are blocking. It means that once you call file_put_contents()
our application stops behaving asynchronously. The flow control is being blocked until the file is saved. ReactPHP has a dedicated package to solve this problem.
Saving files asynchronously
To process files asynchronously in a non-blocking way we need a package called reactphp/filesystem. Go ahead and pull it:
To start working with the file system create an instance of Filesystem
object and provide it as a dependency to our Scraper
. Also, we need to provide a directory where to put all downloaded images:
Here is an updated constructor of the Scraper
:
Ok, now we are ready to save files on disk. First of all, we need to extract a filename from the URL. The scraped URLs to the images look like this:
https://images.pexels.com/photos/4602/jumping-cute-playing-animals.jpg?auto=compress&cs=tinysrgb&h=650&w=940https://images.pexels.com/photos/617278/pexels-photo-617278.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
And filenames for these URLs will be the following:
jumping-cute-playing-animals.jpg
pexels-photo-617278.jpeg
Let’s use a regular expression to extract filenames out of the URLs. To get a full path to a future file on disk we concatenate these names with a directory:
Once we have a path to a file we can use it to create a file object:
This object represents a file we are going to work with. Then call method putContents($contents)
and provide a response body as a string:
That’s it. All asynchronous low-level magic is hidden behind one simple method. Under the hood, it creates a stream in a writing mode, writes data to it and then closes the stream. Here is an updated version of method Scraper::processResponse(string $html)
:
We pass a full path to a file inside the response handler. Then, we create a file and fill it with the response body. Actually, the whole scraper is less than 50 lines of code!
Note: at first, create a directory where you want to store files. Method putContents()
only creates a file, it doesn’t create folders to a specified filename.
The scraper is done. Now, open your main script and pass a list of URLs to scrape:
The snippet above scraps five URLs and downloads appropriate images. And all of this is being done fast and asynchronously.
Conclusion
In the previous tutorial, we have used ReactPHP to speed up the process of web scraping and to query web pages concurrently. But what if we also need to save files concurrently? In an asynchronous application we cannot use such native PHP function like file_put_contents()
, because they block the flow, so there will be no speed increase in storing images on disk. To process files asynchronously in a non-blocking way in ReactPHP we need to use reactphp/filesystem package.
So, in around 50 lines of code, we were able to get a web scraper up and running. This was just a tiny example of something you could do. Now that you have the basic knowledge of how to build a scraper, go and try building your own one!
I have several more articles on web scraping with ReactPHP: check them if you want to use proxy or limit the number of concurrent requests.
You can find examples from this article on GitHub.
This article is a part of the ReactPHP Series.
Learning Event-Driven PHP With ReactPHP
The book about asynchronous PHP that you NEED!
A complete guide to writing asynchronous applications with ReactPHP. Discover event-driven architecture and non-blocking I/O with PHP!
Review by Pascal MARTIN
Minimum price: 5.99$We’d like to continue the sequence of our posts about Top 5 Popular Libraries for Web Scraping in 2020 with a new programming language - JavaScript.
JS is a quite well-known language with a great spread and community support. Vikings watch online season 5 free. It can be used for both client and server web scraping scripting that makes it pretty suitable for writing your scrapers and crawlers.
Most of these libraries' advantages can be received by using our API and some of these libraries can be used in stack with it.
So let’s check them out.
The 5 Top JavaScript Web Scraping Libraries in 2020#
1. Axios#
Axios is a promise-based HTTP client for the browser and Node.js.But why exactly this library? There are a lot of libraries that can be used instead of a well-known request: got, superagent, node-fetch. But Axios is a suitable solution not only for Node.js but for client usage too.
Simplicity of usage is shown below:
Promises are cool, isn’t it?
To get this library you can use one of the preferable ways:
Using npm
:
Using bower
:
Using yarn
:
GitHub repository: https://github.com/axios/axios
2. Cheerio#
Cheerio implements a subset of core jQuery. In simple words - you can just swap your jQuery and Cheerio environments for web scraping. And guess what? It has the same benefit that Axios has - you can use it from client and Node.js as well.
For the sample of usage, you can check another of our articles: Amazon Scraping. Relatively easy.
Also, check out the docs:
- Official docs URL: https://cheerio.js.org/
- GitHub repository: https://github.com/cheeriojs/cheerio
3. Selenium#
Selenium is a popular Web Driver that has a lot of wrappers for most programming languages. Quality Assurance engineers, automation specialists, developers, data scientists - all of them at least once have used this perfect tool. For Web Scraping it’s like a swiss knife - no additional libraries needed. Any action can be performed with a browser like a real user: page opening, button click, form filling, Captcha resolving and much more.
Selenium may be installed via npm
with:
And the usage is a simple too:
Web Scraping React Interview
More info can be found via the documentation:
- Official docs URL: https://selenium-python.readthedocs.io/
- GitHub repository: https://github.com/SeleniumHQ/selenium
4. Puppeteer#
There are a lot of things we can say about Puppeteer: it’s a reliable and production-ready library with great community support. Basically Puppeteer is a Node.js library that offers a simple and efficient API and enables you to control Google’s Chrome or Chromium browser. So you can run a particular site's JavaScript (as well as with Selenium) and scrape single-page applications based on Vue.js, React.js, Angular, etc.
We have a great example of using Puppeteer for scraping Angular-based website, you can check it here: AngularJS site scraping. Easy deal?
Also, we’d like to suggest you check out a great curated list of awesome Puppeteer resources: https://github.com/transitive-bullshit/awesome-puppeteer
As well, useful official resources:
- Official docs URL: https://developers.google.com/web/tools/puppeteer
- GitHub repository: https://github.com/GoogleChrome/puppeteer
5. Playwright#
Not as well-known a library as Puppeteer, but can be named as Puppeteer 2, since Playwright is a library maintained by former Puppeteer contributors. Unlike Puppeteer it supports Chrome, Chromium, Webkit and Firefox backend.
To install it just run the following command:
To be sure, that the API is pretty much the same, you can take a look at the example below:
Web Scraping Reactive
Web Scraping With Python
- Official docs URL: https://github.com/microsoft/playwright/blob/master/docs/README.md
- GitHub repository: https://github.com/microsoft/playwright
Conclusion#
Web Scraping Recipes
It’s always up to you to decide what to use for your particular web scraping case, but it’s also pretty obvious that the amount of data on the Internet increases exponentially and data mining becomes a crucial instrument for your business growth.
React Native Web Scraping
But remember, instead of choosing a fancy tool that may not be of much use, you should focus on finding out a tool that suits your requirements best.