< Back to Blog Overview

How To Web Scrape Google Search Results using Python Beautifulsoup

17-03-2023

In today’s blog, we’ll be diving into web scraping Google Search results, here we will use Python and BeautifulSoup to extract valuable information. We will make a Google Search scraper of our own that can automate the process of pulling URLs, data, and insights.

As we move forward, you will learn how to effectively scrape URLs from Google search results, gaining the ability to gather large amounts of data quickly and efficiently. Get ready as we unfold the steps to extract data from Google search results, transforming the vast ocean of information available into a structured database for your use.

There are several use cases for scraping Google search results. Some of them are: –

  1. Google Scraping can analyze Google’s algorithm and identify its main trends.
  2. It can gain insights for Search engine optimization (SEO) — monitor how your website performs in Google for specific queries over a period of time.
  3. It can analyze ad ranking for a given set of keywords.
  4. SEO tools web scrape Google search results and design a Google search scraper to give you the average volume of keywords, their difficulty score, and other metrics.

In this blog post, we’ll take a look at Python libraries to make this process simple and further, we will scrape Google search results.

scrape google search results using python
Scraping Google Search Results Using Python

Why Python for Scraping Google Search Results?

Python is a widely used & simple language with built-in mathematical functions. Python for data science is still the most demanding skill. It is also flexible and easy to understand even if you are a beginner. The Python community is too big and it helps when you face any error while coding.

Many forums like StackOverflow, GitHub, etc already have the answers to the errors that you might face while coding when you scrape Google search results.

You can do countless things with Python but for now, we will learn web scraping Google search results with it.

Read More: Web scraping 101 with Python (A beginner-friendly tutorial)

Let’s Start Scraping Google Search Results with Python

In this section, we will be web-scraping Google search results for any specific country using Python and a free residential proxy. But first, we will focus on creating a basic Python script & design a basic Google search result scraper that can extract data from the first 10 Google results.

The end result will be JSON data that will consist of a link, title, description, and position.

Prerequisite to scrape Google search results using python

Generally, web scraping with Python is divided into two parts:

  1. Fetching data by making an HTTP request.
  2. Extracting essential data by parsing the HTML DOM.

Libraries & Tools

  1. Beautiful Soup is a Python library for pulling data out of HTML and XML files.
  2. Requests allow you to send HTTP requests very easily.
  3. Residential Proxy to extract the HTML code of the target URL.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.

mkdir scraper <br>pip install beautifulsoup4 <br>pip install requests

Now, create a file inside that folder by any name you like. I am using google.py.

Import the libraries we just installed in that file.

from bs4 import BeautifulSoup<br>import requests

Preparing the Food

Now, since we have all the ingredients to prepare the scraper, we should make a GET request to the target URL to get the raw HTML data. Now we will scrape Google Search results using the requests library as shown below.

We will first try to extract data from the first 10 search results and then we will focus on country-specific results.

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}

url='https://www.google.com/search?q=pizza&amp;ie=utf-8&amp;oe=utf-8&amp;num=10'
html = requests.get(url,headers=headers)

this will provide you with an HTML code of that target URL. Now, you have to use BeautifulSoup to parse HTML.

soup = BeautifulSoup(html.text, 'html.parser')

When you inspect the Google page you will find that all the results come under a class “g”. Of course, this name will change after some time because Google doesn’t like scrapers. You have to keep this in check.

scraping html

We will extract all the classes with the name “g”.

allData = soup.find_all("div",{"class":"g"})

Now, we will run a for loop to reach each and every item in the allData list.

g=0
Data = [ ]
l={}
for i in range(0,len(allData)):
                    link = allData[i].find('a').get('href')

                    if(link is not None):
                        if(link.find('https') != -1 and link.find('http') == 0 and link.find('aclk') == -1):
                            g=g+1
                            l["link"]=link
                            try:
                                l["title"]=allData[i].find('h3').text
                            except:
                                l["title"]=None  

                          try:
                                l["description"]=allData[i].find("span",{"class":"aCOpRe"}).text
                            except:
                                l["description"]=None 

                           l["position"]=g 

                           Data.append(l)  

                          l={}   

                     else:
                            continue 

                   else:
                        continue

                print(Data)

Inside for loop, we have to find the website link, title, and description. We can find the link inside the tagtitle in h3 tag, and the description in a span tag with class aCOpRe.

using CSS identifier

We have to filter out the legit Google links from the raw data. Therefore we have used the find() method to filter out the garbage and ad links. You can filter out ad links just by checking whether they contain ‘aclk’ within the URL string. Then we will add all the data inside a dictionary l and then append it to list Data.

On printing the list Data the output will look like this.

printing the data in JSON format

This method is not reliable because Google will block you after certain requests. We need some advanced tools to overcome this problem.

Know more: 10 tips to avoid getting blocked while scraping the web!!

Scraping Google Search Results from Different Countries

Now, since we know to web scrape Google search results using Python beautifulsoup (as we did in the previous section) we will move on to advanced techniques. Since Google shows different results in different countries for the same keyword.

So, we will now scrape the Google results according to the country of origin. We will use a residential proxy to achieve our results. There are plenty of tools out there to web scrape which you can use for this task.

First, we will create a list of user agents so that we can rotate them on every request. For this tutorial, we will create a list of 10 user agents. If you want more, then you can find them here.

userAgents=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36','Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1']

Now, we need a residential proxy provider via which we can rotate proxies and change the origin of the request. When you signup to Scrapingdog you get 1000 free requests. You can find the proxy documentation here.

You will find your proxy URL on the dashboard. We will create a proxy object to pass it on to the requests method.

http_proxy  = "http://scrapingdog:[email protected]:8081"
https_proxy = "http://scrapingdog:[email protected]:8081"

proxyDict = {"http"  : http_proxy,"https" : https_proxy}

We have used -country=us as a param in our proxy to use USA proxies. Similarly, you can use ‘ca’ for Canada, ‘gb’ for England, ‘in’ for India, etc.

We will use the random library to rotate user agents.

from random import randrange
headers={'User-Agent':userAgents[randrange(10)]}

html = requests.get(url,proxies=proxyDict,headers=headers)

And that’s it. All the rest of the code will remain the same as earlier.

As earlier, we will create a Beautifulsoup object and then extract the same classes. But this time google won’t be able to block you as you are using a new IP on every request.

For the USA, the results will look like this.

google SERP in USA

For the United Kingdom, the Google search result will look like this.

Google SERP in UK

Similarly, you can check for other countries.

Limitations of scraping Google search results with Python

Although Python is an excellent language for web scraping Google search results still there are some limitations to it. Since it is a dynamic language it can lead to runtime errors and it cannot handle multiple threads as well as other languages.

Further, a slow response rate is observed while using Python for scraping Google search results. 

Other than that you cannot continue using just Python for scraping Google at a large scale because Google will ultimately block your script for such a large amount of traffic from just one single IP.

With Scrapingdog’s Google Scraper API, you don’t have to maintain a web scraping script. Scrapingdog will handle all the hassle and seamlessly deliver the data. You can take a trial where the first 1000 requests are on us.

Forget about getting blocked while scraping the Web

Try out Scrapingdog Google Scraper API & Scrape Google Search Results at Scale without Getting Blocked

Using Google’s API to Scrape Google Search Results

Google offers its API to extract data from its search engine. It is available at this link for anyone who wants to use it. However, the usage of this API is very limited due to the following reason: –

  • The API is very costly – For every 1000 requests you make, it will cost you around $5, which doesn’t make sense as you can do it for free with web scraping tools.
  • The API has limited functionality – It is made to scrape only a small group of websites, although by doing changes to it you can scrape the whole web again which would cost you time.
  • Limited Information – The API is made to provide you with little information, thus, any data extracted may not be useful.

Conclusion

In this article, we learned how we can web scrape data from Google search results using Python (beautifulsoup) & further, we used Scrapingdog’s residential Proxy.

Thanks for reading it!!

Frequently Asked Questions

It is easy to use google scraper API. For a step by step instructions, you can check out this documentation.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status