< Back to Blog Overview

Web Scraping with Python – A Step by Step Guide

15-09-2022

In this post, we will learn web scraping with python & will scrape web pages using different libraries that python has to offer. Not just that we will also learn to parse data through different libraries of python.

The reason behind using python for this tutorial is pretty obvious, python is very flexible and has pretty great community support.

Even if you are a beginner trying to learn web scraping using python then this article will be very helpful for you. It is a long tutorial so fasten your seat belts and let’s get started.

web scraping with python
Web Scraping With Python

Before we begin with web scraping in Python let us understand the importance of headers while scraping any web page. We will explore headers in-depth. You might be afraid of headers or you might get an uncomfortable feeling when you see headers like x hyphen something.

I might be wrong but when I started coding I was very intimidated by headers. But soon I realized that it is very simple to use headers while making requests.

HTTP Headers (Importance in web scraping in python)

In this section, I am going to cover the concept of headers with some examples and I will share some links so that you can learn more about headers in detail. So, let’s jump on it.

You might already know when you make API calls, you transfer a piece of information with that in an envelope. Let’s say one person is a client and another person is a server and an envelope is getting transferred in the form of API and that is the mode of communication.

The contents inside that envelope are actually the data that is getting transferred from one person to another but you might also know that when such communications happen in real life on the top of the envelope there is also the address to whom this data has to go to. But along with that address, there is another address that is used when the letter is not received by the receiver.

This is just an analogy but what I am trying to explain to you is that headers also play a similar kind of role.

Headers are a sort of indication for the metadata of what the response or requests consist of inside. Now, to understand this let me categorize headers for you. So, mainly they can be categorized into four different categories.

types of headers
Types of Headers
  • Request Headers
  • Response Headers
  • Payload Headers
  • Representation Headers

It does not mean that a request header cannot be a response header or vice-versa. Let’s understand what each of these headers actually means.

Request Headers

It is a key value pair just like other headers and they are sent by the client who is requesting the data. It is sent so that the server can understand how it has to send the response. It also helps the server to identify the request sender.

Examples of Request headers are

  • Host: www.medium.com
  • User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
  • Referer: https://medium.com
  • Connection: close
  • Accept-Language: en-us
  • Accept-Encoding; gzip

Do remember Content-Type header is not a request header, it is a representation header. We will learn more about this but I just wanted to remove this confusion from your mind at the earliest.

From the list of sample headers shown above, the Host and User-Agent are holding the information that who is sending the request.

Accept-Language tells the server that this is the language in which I can understand your response and similarly Accept-Encoding tells the server that even if you have compressed data I can understand it.

Response Headers

They are just like request headers but the transmission is in reverse. Actually, these headers are sent by the server to the client. It explains to the client what to do with the response. It provides additional information about the data it has sent.

Example of Response headers:

  • Connection: keep-alive
  • Date: Mon, 08 Nov 2022
  • Server: nginx
  • Content-Type: text/html
  • Transfer-Encoding: chunked
  • Etag: W/”0815”

Etag is the response header that is used for versioning and cache. The Date is telling the client the date at which the response was sent from server to client. But again Content-Type or Content-Encoding are representation headers which we are going to cover in a bit.

Representation Headers

Representation headers represent the type of data that has been transferred. The data that has been sent from the server to the client can be in any format like JSON, HTML, XML, chunked (if the data size is huge), etc. The server also tells the client about the range of the content.

Examples of Representation headers:

  • Content-Type: text/html
  • Content-Encoding: gzip
  • Content-Length: 3523
  • Content-Range: bytes 50–1000/*
  • Content-Location: /docs/fo.xml

Content-Location tells the client about the alternate location of the resource or the data that is available for the client to retrieve the information. It can be a URL where that particular resource is stored.

Apart from these headers, there can be different headers like Trailer, Transfer-Encoding, Etag, if-Not-Match, Authorizations, etc.

Now, what if you are writing APIs and you want to define your own custom headers? Can you do that? You can absolutely do that. The way in which you define the request and response structure of your API similarly you can implement custom headers that you or the server is going to accept.

An example of a custom header could be the Authorization header. This header can have any value. Further, a server can use the value to identify the client or it can be used for any other logic operations.

Requests

In this section, we are going to learn about python library requests and with the help of this library, we are going to scrape a website. So, why do we need this library and how can we use it?

It is the most popular library downloaded by everyone. It allows us to make an http request to different websites. It opens a socket to the target website and asks them for their permission to connect. This is how multiple applications can talk with each other.

Now, let’s understand how we can use it with a simple web scraping example. We will scrape amazon for this example.

mkdir scraper
pip install requests

Then create a file scraper.py in this folder and then start coding with me.

target_url = "https://www.amazon.com/dp/B08WVVBWCN"

headers = {“User-Agent”:”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}

resp = requests.get(url, headers=headers).text

pint(resp.status_code)

Here we have declared a target_url variable that stores our target URL from amazon.com. Then we declared a header and then finally we made a GET request to our target URL. This is what happens when we run this code.

scraping amazon

When we print the status we get status as 200 which means we were able to scrape amazon successfully. You can even print the HTML code we received from amazon by just replacing status_code with text.

print(resp.text)

It will look something like this.

image 4 for python blog

As you can see this data is not readable at all. We need to parse the data out from this junk. For this, we will use BeautifulSoup.

BeautifulSoup

It is also known as BS4. So, it is basically used for pulling data out of any HTML or XML files. It is used for searching and modifying any HTML or XML data.

Now lets us understand how we can use it. We will use HTML data from our last section. But before anything, we have to import it into our file.

from bs4 import BeautifulSoup

From our target page, we will extract a few important data like name, price, and product rating. For extracting data we will need a parse tree.

soup=BeautifulSoup(resp.text, ’html.parser’)
image 5 for python blog

When you inspect the name you will see that it is stored inside a class a-size-large product-title-word-break.

name = soup.find(“span”,{“class”:”a-size-large product-title-word-break”}).text
print(name)

When we print the name we get this.

image 6 for python blog

As you can see we got the name of the product. Now, we will extract the price.

image 7 for python blog

By inspecting the price I can see that the price is stored inside a-offscreen class and this class is stored inside priceToPay class.

price = soup.find(“span”,{“class”:”priceToPay”}).find(“span”,{“class”:”a-offscreen”}).text
print(price)

When we print it we get this.

image 8 for python blog

Now, the last part is to extract the rating of the product.

image 9 for python blogs

As you can see the rating is stored inside a class a-icon-star.

rating = soup.find(“i”,{“class”:”a-icon-star”}).text

So, when we print this we get this.

>>> 4.9 out of 5 stars

But if you just need the 4.9 part and you want to remove all of the extra text then we will use split function of python.

rating = soup.find(“i”,{“class”:”a-icon-star”}).text.split(“ “)[0]

This will provide us with just the rating part.

>>> 4.9

We were able to parse out all the necessary data from the junk HTML we got in the first section by making a GET request through the requests library.

Now, what if you have to store this data in a CSV file? We will require the Pandas library for this task.

Pandas

Pandas is a python library that provides flexible data structures and makes our interaction with data very easy. We will use it to save our data in a CSV file.

obj={}

arr=[]

obj[“name”] = soup.find(“span”,{“class”:”a-size-large product-title-word-break”}).text.lstrip()

obj[“price”] = soup.find(“span”,{“class”:”priceToPay”}).find(“span”,{“class”:”a-offscreen”}).text

obj[“rating”] = soup.find(“i”,{“class”:”a-icon-star”}).text.split(“ “)[0]
arr.append(obj)

First, we declared an object and an array. Then we stored all our target data inside this object. Then we pushed this object inside an array. Now, we will create a data frame using pandas with this array, and then using that data frame we will create our CSV file.

This will create a CSV file by the name amazon_data.csv inside your folder.

image 10 for python blog

Pandas made our job a lot easier. Using this technique you can scrape amazon pages at any scale.

Complete code

import requests
from bs4 import BeautifulSoup
import pandas as pd

obj={}
arr=[]

url = “https://www.amazon.com/dp/B08WVVBWCN"

headers = {“User-Agent”:”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}

resp = requests.get(url, headers=headers)
print(resp.status_code)
soup=BeautifulSoup(resp.text,’html.parser’)

obj[“name”] = soup.find(“span”,{“class”:”a-size-large product-title-word-break”}).text.lstrip()

obj[“price”] = soup.find(“span”,{“class”:”priceToPay”}).find(“span”,{“class”:”a-offscreen”}).text

obj[“rating”] = soup.find(“i”,{“class”:”a-icon-star”}).text.split(“ “)[0]

arr.append(obj)
df = pd.DataFrame(arr)
df.to_csv(‘amazon_data.csv’, index=False, encoding=’utf-8')

print(arr)

Summary

As you saw Requests, BS4 and pandas made our job of extracting data from amazon a lot easier. Obviously, if you want to scrape millions of amazon pages with a requests library then you will have to manage many things like proper headers, proxy rotation, and captcha handling.

But but but, if you use Scrapingdog’s Web Scraping API then you won’t have to handle those extra steps at your end. Scrapingdog will use its large pool of proxy and headers to scrape amazon successfully. 

This data extracting tool is not restricted to just amazon, you can scrape any website even if it requires JS rendering. 

Scrapingdog is the fastest and the most reliable web scraping API and of course, we provide 1000 free API credits to our new users.

If you want to learn to scrape other websites like google, yelp, etc using requests and BS4 then read the following articles:

Scrapy

It is a powerful python framework that is used to extract data from any website in a very flexible manner. It uses Xpath to search and extract data. It is lightweight and very easy for beginners to understand.

Now, to understand how Scrapy works we are going to scrape amazon with this framework. We are going to scrape the book section of amazon, more specifically we are going to scrape books that were released in the last 30 days.

image 11 for python blog

We will start with creating a folder and installing Scrapy.

>>> mkdir scraper

>>> pip install scrapy

Now, before we start coding we have to create a project. Just type the below command in your terminal.

>>> scrapy startproject amazonscraper

This command will create a project folder inside scraper folder by the name amazonscraper.

image 12 for python blog

The above command also returns some messages on the terminal where it is telling you how you can start writing your own scraper. We will use both of these commands.

Let’s go inside this amazonscraper folder first.

>>> cd amazonscraper
>>> scrapy genspider amazon_spider amazon.com

This will create a general spider for us so that we don’t have to create our own spider by going inside the spider folder, this will automatically create it for us. Then we name the spider and then we type the domain of our target website.

When you press enter you will have a file by the name amazon_spider.py inside your folder. When you open that file you will find that a parse function and an Amazonspider class have been automatically created.

import scrapy

class AmazonSpiderSpider(scrapy.Spider):
 name = ‘amazon_spider’
 allowed_domains = [‘amazon.com’]
 start_urls = [‘http://amazon.com/']

def parse(self, response):
 pass

We will remove allowed_domains variable as we do not need that and along with that, we will declare start_urls to our target URL.

//amazon_spider.py

import scrapy

class AmazonSpiderSpider(scrapy.Spider):
 name = ‘amazon_spider’
 allowed_domains = [‘amazon.com’]
 start_urls = [‘https://www.amazon.com/s?k=books&i=stripbooks-intl-ship&__mk_es_US=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=11NL2VKJ00J&sprefix=bo%2Cstripbooks-intl-ship%2C443&ref=nb_sb_noss_2']

def parse(self, response):
 pass

Before we begin with our scraper we need to create some items in our items.py file which are temporary containers. We will scrape the title, price, author, and image link from the amazon page.

Since we need four items from amazon we will add four variables for storing the values.

//items.py

import scrapy

class AmazonscraperItem(scrapy.Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 product_name = scrapy.Field()
 product_author = scrapy.Field()
 product_price = scrapy.Field()
 product_imagelink = scrapy.Field()
 pass

Now, we will import this file into our amazon_spider.py file.

//amazon_spider.py

from ..items import AmazonscraperItem

Just type it at the top of the file. Now, inside our parse method, we are going to declare a variable which will be the instance of AmazonscraperItem class.

def parse(self, response):
 items = AmazonscraperItem()
 pass

We are now ready to scrape our target elements from amazon. We will start with scraping the product name.

We will declare a variable product_name which will be equal to a CSS selector for the product name element.

def parse(self, response):
 items = AmazonscraperItem()
 product_name= response.css()
 pass

Here I am going to use the SelectorGadget extension to get the element location on the target page.

image 13 for python blog

At the bottom right you can see our CSS selector. I am just going to copy it from here and I will paste it into our code.

def parse(self, response):
 items = AmazonscraperItem()
 product_name= response.css(‘.a-size-medium’).extract()
 pass

I have used the .extract() function to get the HTML part of all those product elements. Similarly, we are going to use the same technique to extract product price, author, and image link.

While finding CSS selectors for the author SelectorGadget will select some of them and will leave many authors unselected. So, you have to select those authors as well.

image 14 for python blog
def parse(self, response):
 items = AmazonscraperItem()
 product_name= response.css(‘.a-size-medium’).extract()

product_author = response.css(‘.a-color-secondary .a-row .a-size-base+ .a-size-base , .a-color-secondary .a-size-base.s-link-style , .a-color-secondary .a-size-base.s-link-style font’).extract()

 pass

Now, let’s find the CSS selector for the price as well.

image 15 for python blog
def parse(self, response):
 items = AmazonscraperItem()
 product_name= response.css(‘.a-size-medium’).extract()

product_author = response.css(‘.a-color-secondary .a-row .a-size-base+ .a-size-base , .a-color-secondary .a-size-base.s-link-style , .a-color-secondary .a-size-base.s-link-style font’).extract()

product_price = response.css(‘.s-price-instructions-style .a-price-fraction , .s-price-instructions-style .a-price-whole’).extract()
 
 pass

Finally, now we will find the CSS selector for the image.

image 16 for python blog

.s-image is the CSS selector for our images.

def parse(self, response):
 items = AmazonscraperItem()
 product_name= response.css(‘.a-size-medium’).extract()

product_author = response.css(‘.a-color-secondary .a-row .a-size-base+ .a-size-base , .a-color-secondary .a-size-base.s-link-style , .a-color-secondary .a-size-base.s-link-style font’).extract()

product_price = response.css(‘.s-price-instructions-style .a-price-fraction , .s-price-instructions-style .a-price-whole’).extract()

 product_imagelink = response.css(‘.s-image’).extract()

Now, as I said earlier this will only provide us with the HTML code and we need to extract the name out of it. So, for that, we will use the text feature of Scrapy.

This will make sure that the whole tag does not get extracted and that only the text from this tag gets extracted.

product_name= response.css(‘.a-size-medium::text’).extract()

But because we are using multiple classes for the CSS selector that is why we can’t add this text at the end.

We have to use .css() function for product_price and product_author.

product_author = response.css(‘.a-color-secondary .a-row .a-size-base+ .a-size-base , .a-color-secondary .a-size-base.s-link-style , .a-color-secondary .a-size-base.s-link-style font’).css(‘::text’).extract()

product_price = response.css(‘.s-price-instructions-style .a-price-fraction , .s-price-instructions-style .a-price-whole’).css(‘::text’).extract()

Now product_imagelink is just selecting the image so we will not use .css() function on it. Our image is stored inside the src tag and we need its value.

image 17 for python blog

We will use the attr feature of Scrapy.

product_imagelink = response.css(‘.s-image::attr(src)’).extract()

We have managed to extract all the values. Now, we will store them in their individual temporary item containers, and this is how we do it.

items[‘product_name’] = product_name

This product_name is actually the variable that we have declared in our items.py file. We are going to do this with all our other target elements.

items[‘product_name’] = product_name
 items[‘product_author’] = product_author
 items[‘product_price’] = product_price
 items[‘product_imagelink’] = product_imagelink

Now, we just need to yield the items and this will complete our code. Our code might not at first but let’s see what we have got.

yield items

Now, to run our code run the below command on your terminal.

>>> scrapy crawl amazon_spider
image 18 for python blog

As you can see we got an empty array. This is due to the anti-bot mechanism of amazon. To overcome this we are going to set a User-Agent in our settings.py file.

USER_AGENT = ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0’

Now, let’s try again.

image 19 for python blog

Hurray!! we got the results. But as usual, this will not work for long as amazon’s anti-bot technique will kick in and your scraper will stop.

For scraping any number of pages you are advised to use a Python Web Scraping API.

Complete Code

//amazon_spider.py

import scrapy
from ..items import AmazonscraperItem

class AmazonSpiderSpider(scrapy.Spider):
 name = ‘amazon_spider’
 allowed_domains = [‘amazon.com’]
 start_urls = [‘https://www.amazon.com/s?k=books&i=stripbooks-intl-ship&rh=n:283155,p_n_publication_date:1250226011&dc&language=es&ds=v1:0r+6Zb7Q60+15gaAfSXGzhcbIdyc5r/TuKQVY1NC/ew&__mk_es_US=ÅMÅŽÕÑ&crid=11NL2VKJ00J&qid=1662730061&rnid=1250225011&sprefix=bo,stripbooks-intl-ship,443&ref=sr_nr_p_n_publication_date_1']

def parse(self, response):
 items = AmazonscraperItem()
 product_name= response.css(‘.a-size-medium::text’).extract()
 product_author = response.css(‘.a-color-secondary .a-row .a-size-base+ .a-size-base , .a-color-secondary .a-size-base.s-link-style , .a-color-secondary .a-size-base.s-link-style font’).css(‘::text’).extract()
 product_price = response.css(‘.s-price-instructions-style .a-price-fraction , .s-price-instructions-style .a-price-whole’).css(‘::text’).extract()
 product_imagelink = response.css(‘.s-image::attr(src)’).extract()

items[‘product_name’] = product_name
 items[‘product_author’] = product_author
 items[‘product_price’] = product_price
 items[‘product_imagelink’] = product_imagelink

yield items

and this is our items.py file

//items.py

import scrapy

class AmazonscraperItem(scrapy.Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 product_name = scrapy.Field()
 product_author = scrapy.Field()
 product_price = scrapy.Field()
 product_imagelink = scrapy.Field()
 pass

Functionalities of Scrapy do not stop here!!

  • You can set a parallel request number in your settings.py file by changing the value of CONCURRENT_REQUESTS. This will help you to check how much load an API can handle.
  • It is faster than most of the HTTP libraries provided by python.

Selenium

Selenium is a framework to test websites and other web applications. It supports multiple programming languages and on top of that, you get support from multiple browsers not just chrome. It provides APIs to make connections with your driver.

Let’s understand this framework with a simple web scraping task. We are going to scrape a dynamic website with selenium. Our target website will be Walmart. The first step is to install selenium. Type the below command on your terminal to install selenium.

>>> pip install selenium
image 20 for python blog

Our job would be to open this website and extract the HTML code and print it.

So, the first step is to import all the libraries in our file.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Then we are going to set the options which selenium provides. We will set the page size and we will run it in a headless format.

The reason behind running it in a headless form is to avoid extra usage of GUI resources. Even while using selenium in production on external servers you are advised to use it with headless mode to avoid wasting CPU resources. This will ultimately increase your cost because you will need to add more servers for load balancing.

options = Options()
options.headless = True
options.add_argument(“ — window-size=1920,1200”)

Now, we will declare our driver and you have to use the path where your chromium driver is installed.

PATH_TO_DRIVER='YOUR_PATH_TO_CHROIUM_DRIVER'

driver = webdriver.Chrome(options=options, 
executable_path=PATH_TO_DRIVER)

url="https://www.walmart.com/search/?query=python%20books"

We have also declared our target URL. Now, we just need to use it’s .get() method to open the driver.

driver.get(url)

time.sleep(4)
print(driver.page_source)

I used the sleep method to load the website completely before printing the HTML. I just want to make sure that the website is loaded completely before printing.

While printing we have used page_source property of selenium. This will provide us with the source of the current page. This is what we get when we print the results.

image 21 for python blog

We get the required HTML part. Now, Walmart also has an anti-bot detection method just like amazon but Walmart needs JS rendering also for scraping.

To scrape websites like Walmart you can always use Scrapingdog’s Web Scraping Tool to avoid managing JS rendering and proxy rotation.

The reason behind using JS rendering for certain websites is:

  1. To load all the javascript hooks, once all of them are loaded we can easily scrape it at once by just extracting the source after loading it completely on our browser.
  2. Some websites need lots of AJAX calls to load completely. So, in place of a normal GET HTTP call, we render JS for scraping. To verify whether a website needs a JS rendering or not you can always look at the network tab of that website.

It also provides certain properties which might help you in the future.

  • driver.title — This can be used for extracting the title of the page.
  • driver.orientation — This will provide the physical orientation of the device with respect to gravity.

Advantages of using Selenium

  • The best advantage I found is you can use it with any programming language.
  • You can find bugs at an earlier stage of your testing or production.
  • It has great community support.
  • It supports multiple browsers like Chrome, Mozilla, etc.
  • Very handy when it comes to data scraping.

Disadvantages of Using Selenium

  • Image comparison is absent in selenium.
  • Time-consuming.
  • The setup of the test environment is not that easy for beginners.

Regular Expression

Regular expression is a powerful tool to find patterns in text. They are like using Ctrl-F on a word document but much more powerful than them.

This is very helpful when you verify any type of user input and most importantly while scraping the web. The application of Regular expression is very big.

This can be challenging at first but once you are ready, believe me, it will make your job much more efficient.

regular expressions

Its symbols and syntax are universal in all programming languages. To understand regular expression we are going to validate certain strings which you might face while web scraping in Python.

Let’s say you want to scrape emails from the web for the lead generation process of your company. The first part of the email can consist of:

  • Uppercase letters [A-Z]
  • Lower Case letters [a-z]
  • numbers [0–9]

Now, if the email which is scraped does not follow this pattern then we can easily neglect that email and can move on to another email. We will write a simple code in python to identify emails like these and we are going to use re library of python.

import re

pattern = "[a-zA-Z0-9][email protected]"

Brackets allow us to specify that we are looking for the characters in a given string such as email. We are going to match the pattern till @ symbol and the plus sign after the bracket means we are looking for any combination of one or more of these characters.

Since emails are provided by many domains then we have to specify that we are looking for one or more upper and lowercase letters.

pattern = "[a-zA-Z0-9][email protected][a-zA-Z]"

Now, let’s check whether this can work with an if and else statement.

email = input()

if(re.search(pattern,email)):
 print(“Valid email”)
else:
 print(“invalid email”)

Run this file on your terminal to check.

image 23 for python blog

Now, let’s try [email protected]

image 24 for python

This is how you can identify correct email strings. Now, we will learn how we can replace a character with another one using regular expression

Replacing String

This can come in handy when you are making changes to a large database where you might have thousands of strings to update.

Now, let’s say we need every phone number entered into a continuous string of numbers with no hyphens but we want to keep the hyphens which are in word form. We will write regular expressions for that.

import re

pattern = “(\d\d\d)-(\d\d\d)-(\d\d\d\d)”

“\d” will match any single digit. Each set of parenthesis resembles a group.

new_pattern = r”\1\2\3”

So, from left to right we have three different groups. But we need to write what we want this pattern to turn into. Let’s preserve the group but remove the hyphens.

Each backslash number represents a group so our new pattern is concatenating the three groups without the hyphen. We have put r before the string to consider it as the raw string.

Now, let’s take input from the user and check whether it works or not.

import re

pattern = “(\d\d\d)-(\d\d\d)-(\d\d\d\d)”
new_pattern = r”\1\2\3"

phoneNumber = input()

final_output = re.sub(pattern, new_pattern, phoneNumber)

print(final_output)

This was just a basic example of how regular expression can be used in data scraping with python. Regular expression works with any language and the rate of response is pretty fast.

You can find tons of material on regular expression online. I found this course very helpful during my python journey. Also, if you want to test your expression then this website can help you.

XPath

XPath stands for XML path language which is actually a query language for selecting nodes from an XML document. Now, if you do not know about XML documents then web scraping with XPath covers everything for you.

XML stands for Extensible Markup Language which is a bit like your hypertext markup language which is HTML but there is a very distinct difference between the two. HTML has a predefined set of tags that have a special meaning for example you have a body tag or you have a head tag or a paragraph tag. So, all these tags have a special meaning to your browser, right? But for XML there is no such thing. In fact, you can give any name to your tags and they do not have any special meaning there.

The design goal of XML documents is that they emphasize simplicity, generality, and usability across the internet. That’s why you can use any name for your tags and nowadays XML is generally used for the transfer of data from one web service to another. So, that is another main use of XML.

Coming back to Xpath, well it is a query language for XML documents and the special thing to note here is that it is used for selecting nodes. Now, you might be thinking that what are these nodes or this node terminal, right? Well, you can think of any XML document or even any HTML document like a tree.

image 26 for python blog

Now, why I am saying that is because if you try to see this particular XML document you have a tag called “Movie Database” in which you have multiple movie tags then in each movie you have a title tag, year tag, directed by tag, and so on.

So, in this way, we are creating a nested structure, and if you try to visualize a tree we can. We have a movie database tag in which we can have multiple movies in each movie we have a title, year, etc. Similarly, in the cast tag we have actors with different tags for first name and last name.

image 27 for python blog

So, this nesting of the tags allows you to visualize the XML or HTML documents like trees. That’s why we have the concept of nodes in the trees. So, all these tag elements are the nodes of your tree. Similarly, HTML can be visualized and then parsed like a tree.

For parsing, we can use libraries like Beautifulsoup. So, HTML documents or XML documents can be visualized like a tree, and XML parts in a text can be used for querying and selecting some particular nodes which follow the pattern specified by the Xpath syntax to select some particular nodes.

This is the concept behind Xpath and now let me show you some examples so that we can understand Xpath syntax a bit.

Example

We are not going to go into much detail about the Xpath syntax itself because in this video our main aim is to learn how to use Xpath for web scraping.

image 28 for python blog

So, let’s say I have an XML document in which this is the code. I have a bookstore tag at the root in which I have multiple book tags and inside that, I have title and price tags.

You can find this Xpath tester on this website. This is where I am testing this XML and Xpath expression.

image 29 for python blog

Now, if I type “/” in that then it means I want to search from the root of your tree and I will write bookstore. So, what it will do is it will search from the root for the bookstore. So, now if I click TEST XPATH I will get this.

image 30 for python blog

This is the complete bookstore. Now, let’s say in the bookstore I want to get all the books that we have. So, for that, you will do this.

image 31 for python blog

And then I will get this result. I got all the books inside the bookstore.

image 32 for python blog

Now, let’s say you want to get only that book whose ID is 2. So, you will just put a square bracket, and inside that, you will pass ‘@id=”2”’.

image 33 for python blog

When you use @ with some attribute then you are referring to a particular attribute inside your book tag in this case and you are saying hey! find all those book tags whose ID is 2. When we run it we get this.

image 34 for python blog

Look at this, we are getting only that book whose ID is 2. Now, let’s say I want to get the price of that book whose ID is 2. For that, I will simply do this.

image 35 for python blog

And in response, I get this.

image 36 for python blog

So, this is how Xpath works. Now, if you want to learn more about Xpath syntax then you can just visit w3schools for more details. Other than that this is all we need to know in order to create a web scraper using it.

LXML

It is a third-party library for working with XML. We have learned enough about XML in the previous section.

LXML provides full XPath support and nice factory functions that make it a better choice. The goal of LXML is to work with XML using the element tree API stored in lxml etree.

LXML can read from files or string objects of XML and parse them into etree elements.

Now, let’s understand how we can use lxml while web scraping. First, create a folder and install this library.

>>> mkdir scraper
>>> pip install lxml

Once that is done, create a scraper.py file inside your folder scraper and start coding with me.

from lxml import html

import requests

We have imported the requests library to request because we have to get the HTML data of that web page as well.

url=”https://en.wikipedia.org/wiki/Outline_of_the_Marvel_Cinematic_Universe”

and then we will send an HTTP request to our URL.

resp = requests.get(url)
print(resp)

Now, if you will run it you will get 200 code which means we have successfully scraped our target URL.

image 37 for python blog

Now, let’s create a parse tree for our HTML document.

tree = html.fromstring(resp.content)

html.fromstring is a function that takes your HTML content and creates a tree out of it and it will return you the root of that tree. Now, if you print the tree you will get this <Element html at 0x1e18439ff10>. 

So, it says we have got HTML elements at some position, and as you know HTML tag is the root of any HTML document.

Now, I want to search certain elements using Xpath. We have already discovered the Xpath earlier in this article. Xpath of our target element is //*[@id=”mw-content-text”]/div[1]/table[2]/tbody/tr[3]/th/i/a

elements = tree.xpath(‘//*[@id=”mw-content-text”]/div[1]/table[2]/tbody/tr[3]/th/i/a’)

We have passed our Xpath inside the tree function. Do remember to use single or triple quotes while pasting your Xpath because python will give you an error for double quotes because our Xpath already has them.

Let’s print and run it and see what happens.

>>> [Element a at 0x1eaed41c220]

On running the code we got our target element which was matched with this particular Xpath.

you will get this <Element a at 0x1eaed41c220>. As, you can see it is an anchor tag. We have two options to get the data out of this tag.

  1. .text will return the text the contains. Like elements[0].text will return Iron Man
  2. .attrib will return a dictionary {‘href’: ‘/wiki/Iron_Man_(2008_film)’, ‘title’: ‘Iron Man (2008 film)’}. This will provide you with the href tag which is actually the link and that is what we need. We also get the title of the movie.

But since we only need href tag value so we will do this

elements[0].attrib[‘href’]

This will return the target link.

This is what we wanted. If you want to extract all the elements then read web scraping with lxml.

Socket

The socket is used to connect two or more nodes so that they can communicate with each other. A medium server is used which transmits data from one node to another without letting them have a direct connection.

Socket
Socket

Now, how can you make an HTTP request using a socket? Well, it can be done by opening a socket. Let us understand with a simple python code.

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((“www.google.com", 80))
sock.send(b”GET / HTTP/1.1\r\nHost:www.google.com\r\n\r\n")
response = sock.recv(4096)
sock.close()
print(response.decode())

The above code has seven steps and we are going to understand them step-by-step.

  • The first step is to import our socket library.
  • We have created a socket constructor which accepts two parameters. One is the socket family and another one is the socket type. You can read more about it here.
  • Then we have used a web address where we are going to establish a connection. You can pick any address.
  • This is where we make a GET request. It is in a normal text form. Using .send() we have converted it to byte form
  • This is where we as a client catch the response from the server. 4096 is the number of bytes and we want to receive the maximum amount of data.
  • After receiving complete data from the server, we closed the connection. It is a necessary step.
  • We have printed the response.
image 40 for python blog

I hope you must have got a basic idea of how the socket works. Examples of sockets are messaging, web browsing, etc.

Urllib3

Urllib3 is an authentic python library for making HTTP requests to any web address. Now, why it is authentic is because unlike requests it is a built-in part of python. You can use this library if you want to reduce dependencies. This package contains five modules:

  • request — It is used to open URLs.
  • response — This is used by the request module internally. You will not use it directly.
  • error — Provides error classes to request module.
  • parse– It breaks URL into schema, host, port, path, etc.
  • robotparser– It is used to inspect the robots.txt file for permissions.

Now, we will understand how urllib3 can be used through simple code.

import urllib3
http = urllib3.PoolManager()
r = http.request(‘GET’, ‘https://www.scrapingdog.com/robots.txt')
print(r.status)

print(r.data)
image 41 for python blog

Steps will look similar to the requests library. PoolManager keeps a track of a number of connections.

Then we send a normal GET request to a robots.txt URL. We can even send POST and DELETE requests with urllib3.

// POST request

import urllib3
http = urllib3.PoolManager()

r = http.request(‘POST’, ‘http://httpbin.org/post', fields={“Title”: “Scrapingdog”, “Purpose”: “Web Scraping API”, “Feature”: “Fastest Web Scraper”})

print(r.status)

print(r.data)

fields argument will send the data from the client to the server. We are sending a JSON object. The server will send a response to make the confirmation of data added to its database.

image 42 for python blog

There are very high chances that as a beginner you might not use urllib3 for web scraping. Most probably you are going to use requests. But there are certain advantages of using urllib3 over requests.

image 43 for python blog

For parsing data, you can use BS4 or RegEx.

MechanicalSoup

It is like the child of BS4 because it takes the support of BS4 to mechanize everything. It allows us to do so much more with fewer lines of code. It automates website scraping and on top of that, it can follow redirects and can send and store cookies on its own.

Let’s understand MechanicalSoup a little with some python code. You need to install it before we begin coding.

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()

browser object will allow us to enter commands without the need of creating new variables. Now, we are going to open the target URL.

url=”https://www.scrapingdog.com"
browser.open(url)

.open() will return an object of type requests.Response, this is due to the fact that Mechanical Soup is using the requests module to make the call.

Image 44 for python blogs

browser.get_current_page() will provide you with the HTML code of the page. It also provides many arguments like .find_all() and .select_form() to search for any element or tag in our HTML data.

Altogether it is a great library to try web scraping a little differently. If you want to learn more about this I would advise you to read this article.

Conclusion

We discussed eight python libraries that can help you scrape the web. Each library has its own advantage and disadvantage. Some are easy to use but not effective and some of them could be a little difficult to understand but once you have understood them fully it will help you to get things done in no time like RegEx.

I have created a table to give you a brief idea of how all these libraries. I have rated them on the basis of difficulty, usage, and Application. I have given them numbers out of 5.

python libraries
Python Libraries

I hope you like this blog post, I have tried to mention all the popular python libraries in this post.

If you think I have left some topics then please do let us know.

Additional Resources

Frequently Asked Questions

No, It isn’t. Anyone with a basic programming language can scrape website with the right tool.

This tutorial here used all the libraries that are used in python and have extensively explained all of them.
Learning web scraping with python is easy. Also, python is used world wide with ease and hence it is easy to learn web scraping with python.

Beautiful Soup is used for smaller projects while Selenium is used for a little complex projects. Hence, both have different advantages depending where we are using both libraries.

No, you can scrape any data that is available publicly on the web.

A beginner who is looking for a career in web scraping can kickstart his/her journey by just reading this post. Web Scraping in Python is vast and learning it open gates to many opportunities.

Manthan Koolwal

My name Is Manthan Koolwal and I love to create web scrapers. I have been building them for the last 10 years now. I have created many seamless data pipelines for multiple MNCs now. Right now I am working on Scrapingdog, it's a web scraping API that can scrape any website without blockage at any scale. Feel free to contact me for any web scraping query. Happ Scraping!
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status