< Back to Blog Overview

Getting Started with Selenium and Python

2020-05-18
Getting Started with Selenium and Python

Introduction to Selenium

Selenium is an open-source automated testing framework used to validate web applications across different browsers and platforms. It was created by Jason Huggins in 2004, a Software Engineer at ThoughtWorks. He created it when he had to test a web application multiple times, manually leading to higher inefficiency and effort.
The Selenium API has an advantage of controlling firefox, chrome through an external adaptor. It has a much larger community than Puppeteer.
It is an executable module that runs a script on a browser instance.
Today it is mainly used for web scraping and automation puposes.
Why you should use Selenium?
  • clicking on buttons
  • filling forms
  • scrolling
  • taking a screenshot

Requirements

Generally, web scraping is divided into two parts:
  1. Fetching data by making an HTTP request
  2. Extracting important data by parsing the HTML DOM

Libraries & Tools

  1. Beautiful Soup is a Python library for pulling data out of HTML and XML files.
  2. Selenium is used to automate web browser interaction from Python.
  3. Chrome download page
  4. Chrome driver binary

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.
mkdir scraper
pip install beautifulsoup4
pip install selenium

Quickstart

Once you have installed all the libraries, create a python file inside the folder. I am using scraping.py and then import all the libraries as shown below. Also import time in order to let the page load completely.
from selenium import webdriver
from bs4 import BeautifulSoup
import time

What we are going to scrape

We are going to Scrape Python Book price and title from Walmart.
Walmart

Preparing the Food

Now, since we have all the ingredients to prepare the scraper, we should make a GET request to the target URL from Walmart to get the raw HTML data.
options = webdriver.ChromeOptions()
options.add_argument('--headless')driver = webdriver.Chrome('F:/chromed/chromedriver')
url="https://www.walmart.com/search/?query=python%20books"
A headless chrome will be paunched which will be controlled by some external adaptor.
Here are two interesting webdriver properties:
  • driver.stop_client Called after executing a quit command.
  • driver.name Returns the name of the underlying browser for this instance.
Now, to get raw HTML from the website we have to use BeautifulSoup.
time.sleep(4)
soup=BeautifulSoup(driver.page_source,’html.parser’)
driver.close()books=list()
k={}
Now, I am letting it sleep for four seconds. The reason behind this is to let the page load completely. Then we will use BeautifulSoup to parse HTML. driver.page_source will return raw HTML from the website. I have also declared an empty list and dictionary to create a JSON object of the data we are going to scrape.
starting with selenium python
Python book title
After inspecting the title in chrome developer tools, we can see that the title is stored in a “div” tag with class “search-result-product-title listview ”.
starting with selenium python
Python book price
Similarly, the price is stored in “span” tag with class “price display-inline-block arrange-fit price price-main”. Also, we have dive deep inside this tag to find “visuallyhidden” to find the price in text format.
try:
 Title=soup.find_all(“div”,{“class”:”search-result-product-title listview”})
except:
 Title=Nonetry:
 Price = soup.find_all(“span”,{“class”:”price display-inline-block arrange-fit price price-main”})
except:
 Price=None
We have all the titles and prices stored in a list format in variable Title and Price respectively.
We are going to start a for loop so that we can reach each and every book.
for i in range(0,len(Title)):
 try:
  k[“Title{}”.format(i+1)]=Title[i].text.replace(“\n”,””)
 except:
  k[“Title{}”.format(i+1)]=None try:
  k[“Price{}”.format(i+1)]=Price[i].find(“span”,{“class”:”visuallyhidden”}).text.replace(“\n”,””)
 except:
  k[“Price{}”.format(i+1)]=None books.append(k)
 k={}
So, finally, we have all the prices and titles stored inside the list books. After printing it we got.
{
 “PythonBooks”: [
 {
 “Title1”: “Product TitlePython : Advanced Predictive Analytics”,
 “Price1”: “$111.66”
 },
 {
 “Title2”: “Product TitlePython”,
 “Price2”: “$6.99”
 },
 {
 “Title3”: “Product TitlePython : Learn How to Write Codes-Your Perfect Step-By-Step Guide”,
 “Price3”: “$16.05”
 },
 {
 “Title4”: “Product TitlePython: The Complete Beginner’s Guide”,
 “Price4”: “$14.99”
 },
 {
 “Price5”: “$48.19”,
 “Title5”: “Product TitlePython : The Complete Reference”
 },
 {
 “Title6”: “Product TitleThe Greedy Python : Book & CD”,
 “Price6”: “$10.55”
 },
 {
 “Price7”: “$24.99”,
 “Title7”: “Product TitlePython: 2 Manuscripts in 1 Book: -Python for Beginners -Python 3 Guide (Paperback)”
 },
 {
 “Title8”: “Product TitleBooks for Professionals by Professionals: Beginning Python Visualization: Crafting Visual Transformation Scripts (Paperback)”,
 “Price8”: “$67.24”
 },
 {
 “Title9”: “Product TitlePython for Kids: A Playful Introduction to Programming (Paperback)”,
 “Price9”: “$23.97”
 },
 {
 “Price10”: “$17.99”,
 “Title10”: “Product TitlePython All-In-One for Dummies (Paperback)”
 },
 {
 “Title11”: “Product TitlePython Tutorial: Release 3.6.4 (Paperback)”,
 “Price11”: “$14.53”
 },
 {
 “Price12”: “$13.58”,
 “Title12”: “Product TitleCoding for Kids: Python: Learn to Code with 50 Awesome Games and Activities (Paperback)”
 },
 {
 “Price13”: “$56.10”,
 “Title13”: “Product TitlePython 3 Object Oriented Programming (Paperback)”
 },
 {
 “Title14”: “Product TitleHead First Python: A Brain-Friendly Guide (Paperback)”,
 “Price14”: “$35.40”
 },
 {
 “Title15”: “Product TitleMastering Object-Oriented Python — Second Edition (Paperback)”,
 “Price15”: “$44.99”
 },
 {
 “Title16”: “Product TitlePocket Reference (O’Reilly): Python Pocket Reference: Python in Your Pocket (Paperback)”,
 “Price16”: “$13.44”
 },
 {
 “Title17”: “Product TitleData Science with Python (Paperback)”,
 “Price17”: “$39.43”
 },
 {
 “Title18”: “Product TitleHands-On Deep Learning Architectures with Python (Paperback)”,
 “Price18”: “$29.99”
 },
 {
 “Price19”: “$37.73”,
 “Title19”: “Product TitleDjango for Beginners: Build websites with Python and Django (Paperback)”
 },
 {
 “Title20”: “Product TitleProgramming Python: Powerful Object-Oriented Programming (Paperback)”,
 “Price20”: “$44.21”
 }
 ]
}
Similarly, you can scrape any JavaScript enabled website using Selenium and Python. If you don’t want to run these scrapers on your server you can try Scrapingdog which is a proxy API for web scraping.

Conclusion

In this article, we understood how we can scrape data using Selenium BeautifulSoup regardless of the type of website. I hope now you feel more comfortable in scraping web pages.
Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!