< Back to Blog Overview

The best Python web scraping libraries

31-07-2021

 Web scraping is the act of extracting data from websites across the internet. Other synonyms for web scraping are web crawling or web extraction. It’s a simple process with a website URL as the initial target.

Python is a general-purpose language. It has many uses ranging from web development, AI, machine learning, and much more. You can perform web scraping with Python by taking advantage of some libraries and tools available on the internet.

In this tutorial will go through some popular tools and services we can use with Python to scrap a web page. The tools we will discuss include: Beautiful Soup, Requests, Selenium, Scrapy.

The following are the prerequisites you will need to follow along with this tutorial:

● Installation of the latest version of Python

● Install pip — Python package manager.

● A code editor of your choice.

Once you’ve checked with the prerequisites above, create a project directory and navigate into the directory. Open your terminal and run the commands below.

mkdir python_scrapercd python_scraper

Let’s get started.

1. Python Web Scraping using Beautiful Soup

Beautiful Soup is a library that pulls data out of HTML and XML. It works best with parsers, providing elegant ways of navigating, searching, and modifying the parse tree.

Open your terminal and run the command below:

pip install beautifulsoup4

With Beautiful Soup installed, create a new python file, name it beautiful_soup.py

We are going to scrape (Books to Scrape)[https://books.toscrape.com/] website for demonstration purposes. The Books to Scrape website looks like this:

We want to extract the titles of each book and display them on the terminal. The first step in scraping a website is understanding its HTML layout. In this case, you can view the HTML layout of this page by right-clicking on the page, above the first book in the list. Then click Inspect.

Below is a screenshot showing the inspected HTML elements.

You can see that the list is inside the <ol class=”row”> element. The next direct child is the <li> element.

What we want is the book title, that is inside the <a>, inside the <h3>, inside the <article>, and finally inside the <li> element.

To scrape and get the book title, let’s create a new Python file and call it beautiful_soup.py

When done add the following code to beautiful_soup.py file:

from urllib.request import urlopenfrom bs4 import BeautifulSoupurl_to_scrape = “https://books.toscrape.com/"request_page = urlopen(url_to_scrape)page_html = request_page.read()request_page.close()html_soup = BeautifulSoup(page_html, ‘html.parser’)# get book titlefor data in html_soup.select(‘ol’):  for title in data.find_all(‘a’):    print(title.get_text())

In the above code snippet, we open our webpage with the help of the urlopen() method. The read() method reads the whole page and assigns the contents to the page_html variable. We then parse the page using html.parser to help us understand HTML code in a nested fashion.

Next, we use the select() method provided by the BS4 library to get the <ol class=” row”> element. We loop through the HTML elements inside the <ol class=”row”> element to get the <a> tags which contains the book names. Finally, we print out each text inside the <a> tags on every loop it runs with the help of the get_text() method.

You can execute the file using the terminal by running the command below.

python beautiful_soup.py

This should display something like this:

Now let’s get the prices of the books too.

The price of the book is inside a <p> tag, inside a <div> tag. As you can see there is more than one <p> tag and more than one <div> tag. To get the right element with book price, we will use CSS class selectors, lucky for us, each class is unique for each tag.

Below is the code snippet to get the prices of each book, add it at the bottom of the file:

# get book pricesfor price in html_soup.find_all(“p”, class_=”price_color”):  print( price.get_text())

If you run the code on the terminal, you will see something like this:

Your completed code should look like this:

from urllib.request import urlopenfrom bs4 import BeautifulSoupurl_to_scrape = “https://books.toscrape.com/"request_page = urlopen(url_to_scrape)page_html = request_page.read()request_page.close()html_soup = BeautifulSoup(page_html, ‘html.parser’)# get book titlefor data in html_soup.select(‘ol’): for a in data.find_all(‘a’):  print(a.get_text())# get book pricesfor price in html_soup.find_all(“p”, class_=”price_color”): print(price.get_text())

If you’ve reached here, you’ve just noticed how easy this is. Let’s move on to the next library.

2. Python scraping with Requests

Requests is an elegant HTTP library. It allows you to send HTTP requests without the need to add query strings to your URLs.

To use the requests library we first need to install it. Open your terminal and run the command below

pip3 install requests_html

Once you have installed create a new Python file for the code. We are going to prevent naming a file with reserved keywords such as requests. Let’s name the file

requests_scrape.py

Now add the code below inside the created file:

from requests_html import HTMLSessionsession = HTMLSession()r= session.get(‘https://books.toscrape.com/')get_books = r.html.find(‘.row’)[2]# get book titlefor title in get_books.find(‘h3’): print(title.text)# get book pricesfor price in get_books.find(‘.price_color’): print(price.text)

In this code snippet. In the first line, we imported HTMLSession from the request_html library. And instantiated it. We use the session to perform a get request from the BooksToScrape URL.

After performing the get request. We get the unicorn representation of HTML content from our BooksToScrape website. From the HTML content, we get the class row. Located at index 2 that contains the list of books and assign it to the get_books variable.

We want the book title. Like in the first example, the book title is inside the <a>, inside the <h3>. We loop through the HTML content to find each <h3> element and print out the title as text

To get the prices of each book we only change what element the find method should search for in the HTML content. Luckily, the price is inside a <p> with a unique class price_color that’s not anywhere else. We loop through the HTML content and print out the text content of each <p> tag.

Execute the code by running the following command in your terminal:

python requests_scrape.py

Below is the output of the book titles:

Below is the output of the book prices:

You can visit Requests HTML scraping with python to learn more about a ton of things you can do with it.

3. Python Web Scraping with Selenium

Selenium is a web-based automation tool. Its primary purpose is for testing web applications but it can still do well in web scraping.

We are going to import various tools to help us in scraping.

First, we are going to install selenium. There are several ways to install:

● You can install using pip with the command:

pip install selenium

● You can also install using Conda with the command:

conda install –c conda –forge selenium

● Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py with the command below:

python setup.py install

We will be using the chrome browser, and for this, we need the chrome web driver to work with Selenium.

Download chrome web driver using either of the following methods:

1. You can either download directly from the link below

chrome driver download link You will find several download options on the page depending on your version of Chrome. To locate what version of Chrome you have, click on the 3 vertical dots at the top right corner of your browser window, and click ‘Help’ from the menu. On the page that opens, select “About Google Chrome”.

The screenshot below illustrates how to go about it:

After clicking, you will see your version. I have version 92.0.4515.107, shown in the screenshots below:

2. Or by running the commands below, if you are on a Linux machine:

wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zipunzip chromedriver_linux64.zip

After installing. You need to know where you saved your web driver download on your local computer. This will help us get the path to the web driver. Mine is in my home directory.

To get the path to the web driver. Open your terminal and drag the downloaded chrome driver right into the terminal. An output of the web driver path will display.

When you’re done create a new Python file, let’s call it selenium_scrape.py.

Add the following code to the file:

from selenium import webdriverfrom selenium.webdriver.common.by import Byurl = ‘https://books.toscrape.com/'driver = webdriver.Chrome(‘/home/marvin/chromedriver’)driver.get(url)container =driver.find_element_by_xpath (‘//[@id=”default”]/div/div/div/div/section/div[2]/ol’)# get book titlestitles = container.find_elements(By.TAG_NAME, ‘a’)for title in titles: print(title.text)

In the above code, we first import a web driver from selenium which will control chrome. Selenium requires a driver to interface with a chosen browser.

We then specify the driver we want to use, which is chrome. It takes the path to the chrome driver and goes to the site URL. Because we have not launched the browser in headless mode. The browser appears and we can see what it is doing.

The variable container contains the XPath of the <a> tag which has the book title. Selenium provides methods for locating elements, tags, class names, and more. You can read more from selenium location elements

To get the XPath of <a> tag. Inspect the elements, find the <a> tag that has the book title, right-click on it. A dropdown menu will appear, select Copy then select Copy XPath

Just as shown below:

From the variable container. We can then find the titles by the tag name <a> and loop through to print all titles in form of text.

The output will be as shown below:

Now, let’s change the file to get book prices, by adding the following code after the get book titles code.

prices = container.find_elements(By.CLASS_NAME, ‘price_color’)for price in prices: print(price.text)

In this code snippet. We get the prices of each book using the class name of the book price element. And loop through to print all prices in form of text. The output will be like the screenshot below:

Next, we want to access more data by clicking the next button and collecting the other books from other pages.

Change the file to resemble the one below:

from selenium import webdriverfrom selenium.webdriver.common.by import Byurl = ‘https://books.toscrape.com/'driver = webdriver.Chrome(‘/home/marvin/chromedriver’)driver.get(url)def get_books_info(): container      =driver.find_element_by_xpath(‘//[@id=”default”]/div/div/div/div/section/div[2]/ol’) titles = container.find_elements(By.TAG_NAME, ‘a’) for title in titles:  print(title.text) prices = container.find_elements(By.CLASS_NAME, ‘price_color’) for price in prices:  print(price.text) next_page = driver.find_element_by_link_text(‘next’) next_page.click()for x in range(5): get_books_info()driver.quit()

We have created the get_books_info function. It will run several times to scrape data from some pages, in this case, 5 times.

We then use the element_by_link_text() method. to get the text of the <a> element containing the link to the next page.

Next, we add a click function, to take us to the next page. We scrape data and print it out on the console, we repeat this 5 times because of the range function. After 5 successful data scrapes, the driver.quit() method closes the browser.

You can choose a way of storing the data either as a JSON file or in a CSV file. This is a task for you to do in your spare time.

You can dive deeper into selenium and get creative with it. Let’s move on to the next library.

4. Python scraping with Scrapy

Scrapy is a powerful multipurpose tool, used in both scraping the web and crawling the web. Web crawling involves collecting URLs of websites plus all the links associated with the websites. Finally storing them in a structured format in servers.

Scrapy provides many features, but not limited to:

● Selecting and extracting data from CSS selectors

● Support for HTTP, crawl depth restriction, and user-agent spoofing features,

● Storage of structured data in various formats such as JSON, Marshal, CSV, Pickle, and XML.

Let’s dive into Scrapy. We need to make sure we have scrapy installed, install by running the command below:

sudo apt-get updatesudo apt install python3-scrapy

We will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run the command below:

scrapy startproject tutorialcd tutorial

This will create a tutorial directory with the following contents:

spiders/ __init__.py # a directory where you’ll later put your spiderstutorial/scrapy.cfg # deploy configuration filetutorial/__init__.py # project’s Python module, you’ll import your code from hereitems.py # project items definition fileMiddlewares.py # project middlewares filepipelines.py # project pipelines filesettings.py # project settings file

The screenshot below shows the project structure:

Before we add code into our created project. The best way to learn how to extract data with Scrapy is by using the Scrapy Shell.

Scraping using the Scrapy Shell

The shell comes in handy. Because, it quickens debugging of our code when scrapping, without the need to run the spider. To run Scrapy shell you can use the shell command like below:

scrapy shell <url>

On your terminal run :

scrapy shell ‘https://books.toscrape.com/’

If you don’t get any data back, you can add the user agent with the command below:

scrapy shell –s USER_AGENT=’ ’ ‘https://books.toscrape.com/’

To get USER_AGENT, open your dev tools with ctrl+shift+i. Navigate to the console, clear the console, type navigator.userAgent then hit enter.

An example of a USER AGENT can be: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36.

The screenshot below shows how to get the name of your USER_AGENT in the dev tools:

If you’re successful in getting output from the shell command, you will see a resemblance to the one below:

Using the shell, you can try selecting elements using CSS. The shell returns a response object.

Let us get the response object containing the titles and prices of the books from our test

website Bookstoscrape. The book title is inside element <a> element, inside the <h3>, inside <article>, inside <li>, inside <ol> with a class row. And, finally inside a <div> element.

We create a variable container and assign it to the response object containing the <ol> element with a class of rows inside a <div> element.

To see the container output in our Scrapy shell, type in container and hit enter, the output will be like below:

Now, let us find the book title of each book, using the response object we got above

Create a variable called titles; this will hold our book titles. Using the container with the response object we got above.

We will select <a> inside the <h3> element, using the CSS selectors that scrapy provides.

We then use the CSS extension provided by scrapy to get the text of the <a> element.

Finally, we use the getall() method to get all the titles. As shown below:

Run titles to get the output of all the book titles. You should see an output like below:

That went well.

Now, let us get prices for each book.

In the same scrapy shell, create prices variable to hold our prices. We use the same container response object. With CSS, we select <p> element with a class of price_color.

We use the CSS extension provided by scrapy to get the text from the <p> element.

Finally, the getall() method gets all the prices.

As shown below:

Run the prices, your output should look like below:

That was quick right? Scrapy shell saves us a lot of time in debugging as it provides an interactive shell.

Now let’s try using a spider.

Scraping using spider

Let’s go back to the tutorial folder we created; we are going to add a spider.

A spider is what scrapy uses to scrape information from a website or a group of websites.

Create a new file. Name it books_spider.py under the tutorial/spiders directory in your project.

Add the following code into the file:

import scrapyclass BooksSpider(scrapy.Spider):name = “books”start_urls = [‘https://books.toscrape.com/']def parse(self, response): for book in response.css(‘div ol.row’):  title = book.css(‘h3 a::text’).getall()  price = book.css(‘p.price_color::text’).getall()  yield {  ‘title’: book.css(‘h3 a::text’).getall(),  ‘price’: book.css(‘p.price_color::text’).getall()  }

The BooksSpider subclasses scapy.Spider. It has a name attribute which is the name of our spider and the start_urls attribute which has a list of URLs.

The list with URLs will make the initial requests for the spider.

It can also define how to follow links in the pages, and how to parse the downloaded page content to extract data.

The parse method, parses the response, extracting the scraped data as dictionaries. It also finds new URLs to follow and creates new requests from them.

To get output from our code, let’s run a spider. To run a spider, you can run the command with the syntax below:

scrapy crawl <spider name>

On your terminal run the command below:

scrapy crawl books

You will get an output resembling the one below:

We can store the extracted data in a JSON file. We can use Feed exports which scrapy provides out of the box. It supports many serialization formats including JSON, XML, or CSV, just to name a few.

To generate a JSON file with the scraped data, run the command below:

scrapy crawl books –o books.json

This will generate a books.json file with its contents resembling the one below:

Following links with scrapy

Let’s try to follow the link to the next page and extract more book titles. We inspect the elements and get the link to the page we want to follow.

The link is <a> tag <li> with a class next, inside <ul> tag with class pager, finally inside a <div> tag

Below is a screenshot of the inspected element with a link to our next page:

Let’s use the scrapy shell to get the link to the next page first. Run the scrapy shell command with the books to scrape Url.

We get the href attribute, to determine the specific URL the next page goes to, just like below:

Let’s now use our spider, modify the books_spider.py file to repeatedly follow the link to the next page, extracting data from each page.

import scrapyclass BooksSpider(scrapy.Spider):name = “books”start_urls = [‘https://books.toscrape.com/']def parse(self, response): for book in response.css(‘div ol.row’):  title = book.css(‘h3 a::text’).getall()  price = book.css(‘p.price_color::text’).getall()  yield {  ‘title’: book.css(‘h3 a::text’).getall(), ‘price’:        book.css(‘p.price_color::text’).getall()  }  next_page = response.css(‘li.next a::attr(href)’).get()    if next_page is not None:yield response.follow(next_page, callback=self.parse)

In this code snippet. We create a variable next_page that holds the URL to the next page. We then check if the link is not empty. Next, we use the response.follow method, and pass the URL and a callback this returns a Request instance. Finally, we yield this Request.

We can go back to the terminal and extract a list of all books and titles into an allbooks.json file.

Run the command below:

scrapy crawl books –o allbooks.json

After it’s done scraping, open the newly create allbooks.json file. The output is like below:

You can do a ton of things with scrapy, including pausing and resuming crawls, and a wide range of web scraping tasks.

Conclusion

In this tutorial, we discussed the various Python open-source libraries for website data scraping. If you followed along to the end. You are now able to create from simple to more complex scrapers to crawl over an unlimited number of web pages. You can dive deeper into these libraries and hone your skills. Data is a very important part of decision-making in the world we live in today. Mastering how to collect data will place you way ahead.

The code for this tutorial is available from this Github Repository.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!