Webscraping Course

Posted on  by 



Some websites can contain a very large amount of invaluable data.

Stock prices, product details, sports stats, company contacts, you name it.

A JavasScript web scraping course - covering puppeteer, node fetch, crawling, Google sheets, deployment and much more! This course is designed for anyone who wants to learn everything about getting started with web scraping using Python. Web scraping is an incredibly useful tool to have in your data scientist’s armoury and this course will get you started on the right footing. I have decent programming experience but no background in machine learning. About this Python API Tutorial. This tutorial is based on part of our interactive course on APIs and Webscraping in Python, which you can start for free. For this tutorial, we assume that you know some of the fundamentals of working with data in Python.

If you wanted to access this information, you’d either have to use whatever format the website uses or copy-paste the information manually into a new document. Here’s where web scraping can help.

What is Web Scraping?

Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.

Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.

But in most cases, web scraping is not a simple task. Websites come in many shapes and forms, as a result, web scrapers vary in functionality and features.

If you want to find the best web scraper for your project, make sure to read on.

How do Web Scrapers Work?

Automated web scrapers work in a rather simple but also complex way. After all, websites are built for humans to understand, not machines.

First, the web scraper will be given one or more URLs to load before scraping. The scraper then loads the entire HTML code for the page in question. More advanced scrapers will render the entire website, including CSS and Javascript elements.

Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run.

Ideally, the user will go through the process of selecting the specific data they want from the page. For example, you might want to scrape an Amazon product page for prices and models but are not necessarily interested in product reviews.

Lastly, the web scraper will output all the data that has been collected into a format that is more useful to the user.

Most web scrapers will output data to a CSV or Excel spreadsheet, while more advanced scrapers will support other formats such as JSON which can be used for an API.

What Kind of Web Scrapers are There?

Web scrapers can drastically differ from each other on a case-by-case basis.

For simplicity’s sake, we will break down some of these aspects into 4 categories. Of course, there are more intricacies at play when comparing web scrapers.

  • self-built or pre-built
  • browser extension vs software
  • User interface
  • Cloud vs Local

Self-built or Pre-built

Just like how anyone can build a website, anyone can build their own web scraper.

However, the tools available to build your own web scraper still require some advanced programming knowledge. The scope of this knowledge also increases with the number of features you’d like your scraper to have.

On the other hand, there are numerous pre-built web scrapers that you can download and run right away. Some of these will also have advanced options added such as scrape scheduling, JSON and Google Sheets exports and more.

Webscraping

Browser extension vs Software

In general terms, web scrapers come in two forms: browser extensions or computer software.

Browser extensions are app-like programs that can be added onto your browser such as Google Chrome or Firefox. Some popular browser extensions include themes, ad blockers, messaging extensions and more.

Web scraping extensions have the benefit of being simpler to run and being integrated right into your browser.

However, these extensions are usually limited by living in your browser. Meaning that any advanced features that would have to occur outside of the browser would be impossible to implement. For example, IP Rotations would not be possible in this kind of extension.

On the other hand, you will have actual web scraping software that can be downloaded and installed on your computer. While these are a bit less convenient than browser extensions, they make up for it in advanced features that are not limited by what your browser can and cannot do.

User Interface

Webscraping

The user interface between web scrapers can vary quite extremely.

For example, some web scraping tools will run with a minimal UI and a command line. Some users might find this unintuitive or confusing.

On the other hand, some web scrapers will have a full-fledged UI where the website is fully rendered for the user to just click on the data they want to scrape. These web scrapers are usually easier to work with for most people with limited technical knowledge.

Some scrapers will go as far as integrating help tips and suggestions through their UI to make sure the user understands each feature that the software offers.

Cloud vs Local

From where does your web scraper actually do its job?

Course

Local web scrapers will run on your computer using its resources and internet connection. This means that if your web scraper has a high usage of CPU or RAM, your computer might become quite slow while your scrape runs. With long scraping tasks, this could put your computer out of commission for hours.

Additionally, if your scraper is set to run on a large number of URLs (such as product pages), it can have an impact on your ISP’s data caps.

Cloud-based web scrapers run on an off-site server which is usually provided by the company who developed the scraper itself. This means that your computer’s resources are freed up while your scraper runs and gathers data. You can then work on other tasks and be notified later once your scrape is ready to be exported.

This also allows for very easy integration of advanced features such as IP rotation, which can prevent your scraper from getting blocked from major websites due to their scraping activity.

What are Web Scrapers Used For?

By this point, you can probably think of several different ways in which web scrapers can be used. We’ve put some of the most common ones below (plus a few unique ones).

  • Scraping site data before a website migration
  • Scraping financial data for market research and insights

The list of things you can do with web scraping is almost endless. After all, it is all about what you can do with the data you’ve collected and how valuable you can make it.

Read our Beginner's guide to web scraping to start learning how to scrape any website!

The Best Web Scraper

So, now that you know the basics of web scraping, you’re probably wondering what is the best web scraper for you?

The obvious answer is that it depends.

The more you know about your scraping needs, the better of an idea you will have about what’s the best web scraper for you. However, that did not stop us from writing our guide on what makes the Best Web Scraper.

Of course, we would always recommend ParseHub. Not only can it be downloaded for FREE but it comes with an incredibly powerful suite of features which we reviewed in this article. Including a friendly UI, cloud-based scrapping, awesome customer support and more.

Want to become an expert on Web Scraping for Free? Take ourfree web scraping courses and become Certified in Web Scraping today!

Contents

  • What is Beautiful Soup?
  • Application: Extracting names and URLs from an HTML page
  • But wait! What if I want ALL of the data?

Version: Python 3.6 and BeautifulSoup 4.

This tutorial assumes basic knowledge of HTML, CSS, and the DocumentObject Model. It also assumes some knowledge of Python. For a more basicintroduction to Python, see Working with Text Files.

Most of the work is done in the terminal. For an introduction to usingthe terminal, see the Scholar’s Lab Command Line Bootcamp tutorial.

What is Beautiful Soup?

Overview

“You didn’t write that awful page. You’re just trying to get some dataout of it. Beautiful Soup is here to help.” (Opening lines of BeautifulSoup)

Beautiful Soup is a Python library for getting data out of HTML, XML,and other markup languages. Say you’ve found some webpages that displaydata relevant to your research, such as date or address information, butthat do not provide any way of downloading the data directly. BeautifulSoup helps you pull particular content from a webpage, remove the HTMLmarkup, and save the information. It is a tool for web scraping thathelps you clean up and parse the documents you have pulled down from theweb.

The Beautiful Soup documentation willgive you a sense of variety of things that the Beautiful Soup librarywill help with, from isolating titles and links, to extracting all ofthe text from the html tags, to altering the HTML within the documentyou’re working with.

Installing Beautiful Soup

Installing Beautiful Soup is easiest if you have pip or another Pythoninstaller already in place. If you don’t have pip, run through a quicktutorial on installing python modules to get it running. Once youhave pip installed, run the following command in the terminal to installBeautiful Soup:

You may need to preface this line with “sudo”, which gives your computerpermission to write to your root directories and requires you tore-enter your password. This is the same logic behind you being promptedto enter your password when you install a new program.

With sudo, the command is:

Additionally, you will need to install a “parser” for interpreting the HTML. To do so, run in the terminal:

or

Finally, so that this code works with either Python2 or Python3, you will need one helper library. Run in the terminal:

or

Application: Extracting names and URLs from an HTML page

Preview: Where we are going

Because I like to see where the finish line is before starting, I willbegin with a view of what we are trying to create. We are attempting togo from a search results page where the html page looks like this:

to a CSV file with names and urls that looks like this:

Web scraping software

using a Python script like this:

This tutorial explains to how to assemble the final code.

Get a webpage to scrape

Dell n4110 driver wireless. The first step is getting a copy of the HTML page(s) want to scrape. Youcan combine BeautifulSoup with urllib3 to work directly with pageson the web. This tutorial, however, focuses on using BeautifulSoup withlocal (downloaded) copies of html files.

The Congressional database that we’re using is not an easy one to scrapebecause the URL for the search results remains the same regardless ofwhat you’re searching for. While this can be bypassed programmatically,it is easier for our purposes to goto http://bioguide.congress.gov/biosearch/biosearch.asp, search forCongress number 43, and to save a copy of the results page.

Selecting “File” and “Save Page As …” from your browser window willaccomplish this (life will be easier if you avoid using spaces in yourfilename). I have used “43rd-congress.html”. Move the file into thefolder you want to work in.

Best Web Scraping Tools

(To learn how to automate the downloading of HTML pages using Python,see Automated Downloading with Wget and Downloading MultipleRecords Using Query Strings.)

Identify content

One of the first things Beautiful Soup can help us with is locatingcontent that is buried within the HTML structure. Beautiful Soup allowsyou to select content based upon tags (example: soup.body.p.b finds thefirst bold item inside a paragraph tag inside the body tag in thedocument). To get a good view of how the tags are nested in thedocument, we can use the method “prettify” on our soup object.

Create a new text file called “soupexample.py” in the same location asyour downloaded HTML file. This file will contain the Python script thatwe will be developing over the course of the tutorial.

To begin, import the Beautiful Soup library, open the HTML file and passit to Beautiful Soup, and then print the “pretty” version in theterminal.

Save “soupexample.py” in the folder with your HTML file and go to thecommand line. Navigate (use ‘cd’) to the folder you’re working in andexecute the following:

You should see your terminal window fill up with a nicely indentedversion of the original html text (see Figure 3). This is a visualrepresentation of how the various tags relate to one another.

Using BeautifulSoup to select particular content

Remember that we are interested in only the names and URLs of thevarious member of the 43rd Congress. Looking at the “pretty” version ofthe file, the first thing to notice is that the data we want is not toodeeply embedded in the HTML structure.

Both the names and the URLs are, most fortunately, embedded in “<a>”tags. So, we need to isolate out all of the “<a>” tags. We can do thisby updating the code in “soupexample.py” to the following:

Note that we added a “#” to the beginning of the print(soup.prettify()) line. The hash or pound sign “comments out” the code, or turns a line of code into a comment. This tells the computer to skip over the line when executing the program. Commenting out code that is no longer in use is one way to keep track of what we have done in the past.

Save and run the script again to see all of the anchor tags in thedocument.

One thing to notice is that there is an additional link in our file –the link for an additional search.

We can get rid of this with just a few lines of code. Going back to thepretty version, notice that this last “<a>” tag is not within thetable but is within a “<p>” tag.

Because Beautiful Soup allows us to modify the HTML, we can remove the“<a>” that is under the “<p>” before searching for all the “<a>”tags.

To do this, we can use the “decompose” method, which removes thespecified content from the “soup”. Do be careful when using“decompose”—you are deleting both the HTML tag and all of the datainside of that tag. If you have not correctly isolated the data, you maybe deleting information that you wanted to extract. Update the file asbelow and run again.

Web Scraping With Python Course

Success! We have isolated out all of the links we want and none of the links we don’t!

Stripping Tags and Writing Content to a CSV file

But, we are not done yet! There are still HTML tags surrounding the URLdata that we want. And we need to save the data into a file in order touse it for other projects.

In order to clean up the HTML tags and split the URLs from the names, weneed to isolate the information from the anchor tags. To do this, wewill use two powerful, and commonly used Beautiful Soup methods:contents and get.

Where before we told the computer to print each link, we now want thecomputer to separate the link into its parts and print those separately.For the names, we can use link.contents. The “contents” method isolatesout the text from within html tags. For example, if you started with

you would be left with “This is my Header text” after applying thecontents method. In this case, we want the contents inside the first tagin “link”. (There is only one tag in “link”, but since the computerdoesn’t realize that, we must tell it to use the first tag.)

For the URL, however, “contents” does not work because the URL is partof the HTML tag. Instead, we will use “get”, which allow us to pull thetext associated with (is on the other side of the “=” of) the “href”element.

Finally, we want to use the CSV library to write the file. First, weneed to import the CSV library into the script with “import csv.” Next,we create the new CSV file when we “open” it using “csv.writer”. The “w”tells the computer to “write” to the file. And to keep everythingorganized, let’s write some column headers. Finally, as each line isprocessed, the name and URL information is written to our CSV file.

When executed, this gives us a clean CSV file that we can then use forother purposes.

We have solved our puzzle and have extracted names and URLs from theHTML file.

But wait! What if I want ALL of the data?

Let’s extend our project to capture all of the data from the webpage. Weknow all of our data can be found inside a table, so let’s use “<tr>”to isolate the content that we want.

Looking at the print out in the terminal, you can see we have selected alot more content than when we searched for “<a>” tags. Now we need tosort through all of these lines to separate out the different types ofdata.

Extracting the Data

We can extract the data in two moves. First, we will isolate the linkinformation; then, we will parse the rest of the table row data.

Webscraping Course

For the first, let’s create a loop to search for all of the anchor tagsand “get” the data associated with “href”.

We then need to run a search for the table data within the table rows.(The “print” here allows us to verify that the code is working but isnot necessary.)

Next, we need to extract the data we want. We know that everything wewant for our CSV file lives within table data (“td”) tags. We also knowthat these items appear in the same order within the row. Because we aredealing with lists, we can identify information by its position withinthe list. This means that the first data item in the row is identifiedby [0], the second by [1], etc.

Because not all of the rows contain the same number of data items, weneed to build in a way to tell the script to move on if it encounters anerror. This is the logic of the “try” and “except” block. If aparticular line fails, the script will continue on to the next line.

Within this we are using the following structure:

We are applying the “get_text” method to the 2nd element in the row(because computers count beginning with 0) and creating a string fromthe result. This we assign to the variable “years”, which we will use tocreate the CSV file. We repeat this for every item in the table that wewant to capture in our file.

Writing the CSV file

The last step in this file is to create the CSV file. Here we are usingthe same process as we did in Part I, just with more variables.

As a result, our file will look like:

You’ve done it! You have created a CSV file from all of the data in the table, creating useful data from the confusion of the html page.





Coments are closed