Selenium Python Web Scraping (A Step By Step Guide)

In this post, we will learn how to do Selenium web scraping with Python. Selenium is an open-source automated testing framework used to validate web applications across different browsers and platforms. It was created by Jason Huggins in 2004, a Software Engineer at ThoughtWorks.

He created it when he had to test a web application multiple times, manually leading to higher inefficiency and effort. The Selenium API has the advantage of controlling Firefox, chrome through an external adaptor. It has a much larger community than Puppeteer. It is an executable module that runs a script on a browser instance & hence is also called Python headless browser scraping.

Why you should use Selenium?

Today selenium is mainly used for web scraping and automation purposes.

clicking on buttons
filling forms
scrolling
taking a screenshot

python selenium web scraping — Web Scraping with Selenium & Python

Table of Contents

Requirements for Web Scraping With Selenium & Python

Generally, web scraping is divided into two parts:

Fetching data by making an HTTP request
Extracting important data by parsing the HTML DOM

Libraries & Tools

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Selenium is used to automate web browser interaction from Python.
Chrome download page
Chrome driver binary

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries, type the below-given commands. I am assuming that you have already installed Python 3.x.

mkdir scraper
pip install beautifulsoup4
pip install selenium

Quickstart

Once you have installed all the libraries, create a Python file inside the folder. I am using scraping.py and then importing all the libraries as shown below. Also, import time in order to let the page load completely.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

What We Are Going to Scrape Using Selenium

We are going to extract the Python Book price and title from Walmart via selenium.

selenium web scraping book price and title — Scraping Book Price and Title using Selenium

Preparing the Food

Now, since we have all the ingredients to prepare the scraper, we should make a GET request to the target URL to get the raw HTML data.

options = webdriver.ChromeOptions()
options.add_argument('--headless')

driver = webdriver.Chrome('F:/chromed/chromedriver')
url="https://www.walmart.com/search/?query=python%20books"

A headless chrome will be launched which will be controlled by some external adaptor. Here are two interesting webdriver properties:

driver.stop_client Called after executing a quit command.
driver.name Returns the name of the underlying browser for this instance.

Now, to get raw HTML from the website, we have to use BeautifulSoup.

time.sleep(4)
soup=BeautifulSoup(driver.page_source,’html.parser’)
driver.close()

books=list()
k={}

Now, I am letting it sleep for four seconds. The reason behind this is to let the page load completely. Then we will use BeautifulSoup to parse HTML. driver.page_source will return raw HTML from the website.

I have also declared an empty list and dictionary to create a JSON object of the data we are going to scrape.

page source inspection while scraping book with selenium — Book Title in Inspection of Page Source

After inspecting the title in chrome developer tools, we can see that the title is stored in a “div” tag with class “search-result-product-title listview”.

Inspecting Book Price in page source code while selenium web scraping — Inspecting Book Price

Similarly, the price is stored in “span” tag with class “price display-inline-block arrange-fit price price-main.” Also, we have to dive deep inside this tag to find “visuallyhidden” to find the price in text format.

try:
 Title=soup.find_all(“div”,{“class”:”search-result-product-title listview”})
except:
 Title=None

try:
 Price = soup.find_all(“span”,{“class”:”price display-inline-block arrange-fit price price-main”})
except:
 Price=None

We have all the titles and prices stored in a list format in variable Title and Price, respectively. We are going to start a for loop so that we can reach each and every book.

for i in range(0,len(Title)):
 try:
  k[“Title{}”.format(i+1)]=Title[i].text.replace(“\n”,””)
 except:
  k[“Title{}”.format(i+1)]=None 

try:
  k[“Price{}”.format(i+1)]=Price[i].find(“span”,{“class”:”visuallyhidden”}).text.replace(“\n”,””)
 except:
  k[“Price{}”.format(i+1)]=None

 books.append(k)
 k={}

So, finally, we have all the prices and titles stored inside the list books. After printing it we got.

{
 “PythonBooks”: [
 {
 “Title1”: “Product TitlePython : Advanced Predictive Analytics”,
 “Price1”: “$111.66”
 },
 {
 “Title2”: “Product TitlePython”,
 “Price2”: “$6.99”
 },
 {
 “Title3”: “Product TitlePython : Learn How to Write Codes-Your Perfect Step-By-Step Guide”,
 “Price3”: “$16.05”
 },
 {
 “Title4”: “Product TitlePython: The Complete Beginner’s Guide”,
 “Price4”: “$14.99”
 },
 {
 “Price5”: “$48.19”,
 “Title5”: “Product TitlePython : The Complete Reference”
 },
 {
 “Title6”: “Product TitleThe Greedy Python : Book & CD”,
 “Price6”: “$10.55”
 },
 {
 “Price7”: “$24.99”,
 “Title7”: “Product TitlePython: 2 Manuscripts in 1 Book: -Python for Beginners -Python 3 Guide (Paperback)”
 },
 {
 “Title8”: “Product TitleBooks for Professionals by Professionals: Beginning Python Visualization: Crafting Visual Transformation Scripts (Paperback)”,
 “Price8”: “$67.24”
 },
 {
 “Title9”: “Product TitlePython for Kids: A Playful Introduction to Programming (Paperback)”,
 “Price9”: “$23.97”
 },
 {
 “Price10”: “$17.99”,
 “Title10”: “Product TitlePython All-In-One for Dummies (Paperback)”
 },
 {
 “Title11”: “Product TitlePython Tutorial: Release 3.6.4 (Paperback)”,
 “Price11”: “$14.53”
 },
 {
 “Price12”: “$13.58”,
 “Title12”: “Product TitleCoding for Kids: Python: Learn to Code with 50 Awesome Games and Activities (Paperback)”
 },
 {
 “Price13”: “$56.10”,
 “Title13”: “Product TitlePython 3 Object Oriented Programming (Paperback)”
 },
 {
 “Title14”: “Product TitleHead First Python: A Brain-Friendly Guide (Paperback)”,
 “Price14”: “$35.40”
 },
 {
 “Title15”: “Product TitleMastering Object-Oriented Python — Second Edition (Paperback)”,
 “Price15”: “$44.99”
 },
 {
 “Title16”: “Product TitlePocket Reference (O’Reilly): Python Pocket Reference: Python in Your Pocket (Paperback)”,
 “Price16”: “$13.44”
 },
 {
 “Title17”: “Product TitleData Science with Python (Paperback)”,
 “Price17”: “$39.43”
 },
 {
 “Title18”: “Product TitleHands-On Deep Learning Architectures with Python (Paperback)”,
 “Price18”: “$29.99”
 },
 {
 “Price19”: “$37.73”,
 “Title19”: “Product TitleDjango for Beginners: Build websites with Python and Django (Paperback)”
 },
 {
 “Title20”: “Product TitleProgramming Python: Powerful Object-Oriented Programming (Paperback)”,
 “Price20”: “$44.21”
 }
 ]
}

Similarly, you can scrape any JavaScript-enabled website using Selenium and Python. If you don’t want to run these scrapers on your server, try Scrapingdog, a proxy Web Scraping API.

Conclusion

In this article, we understood how we could scrape data using Selenium & BeautifulSoup regardless of the type of website. I hope now you feel more comfortable scraping web pages.

If you are new to Python you can read this comprehensive guide on web scraping with Python.

Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading, and please hit the like button!

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Web Scraping with Selenium & Python (A Beginner Friendly Tutorial)

Requirements for Web Scraping With Selenium & Python

Libraries & Tools

Setup

Quickstart

What We Are Going to Scrape Using Selenium

Preparing the Food

Conclusion

Additional Resources

Manthan Koolwal

PRODUCT

Why Scrapingdog?

Learn Web Scraping

Company

SOCIAL