< Back to Blog Overview

Web Scraping Hotel Prices & Other Data from booking.com using Python

23-11-2022

The hotel industry is continuously growing for the last 15 years since the last recession. With growth, competition has also increased in the industry.

Every other day market welcomes new vendors and with that profit margins get narrow for the old vendors. So, it has become quite challenging for the OTAs to keep up with the booking revenue.

scraping booking.com hotel data prices using python
How To Scrape Hotel Data

But OTAs can overcome this problem by tracking their competitor’s pricing. But the question is how can you track them? Well, here not only web scraping can help you track your competitors but will also improve your revenue.

In this post, we are going to scrape booking.com with python. By the end of this tutorial, you will be able to scrape prices of any hotel from booking.com by just passing check-in/out dates and the unique ID of the hotel.

Why use Python to Scrape booking.com

Python is the most versatile language and is used extensively with web scraping. Moreover, it has dedicated libraries for scraping the web.

With a large community, you might get your issues solved whenever you are in trouble. If you are new to web scraping with python, I would recommend you to go through this guide comprehensively made for web scraping with it.

Requirements for scraping booking.com

We need Python 3.x for this tutorial and I am assuming that you have already installed that on your computer. Along with that, you need to install two more libraries which will be used further in this tutorial for web scraping.

  1. Requests will help us to make an HTTP connection with Bing.
  2. BeautifulSoup will help us to create an HTML tree for smooth data extraction.

Setup

First, create a folder and then install the libraries mentioned above.

mkdir booking
pip install requests 
pip install beautifulsoup4

Inside this folder create a python file where will write the code. These are the following data points that we are going to scrape from the target website.

  • Address
  • Name
  • Pricing
  • Rating
  • Room Type
  • Facilities
image 21
image 22

Let’s Scrape Booking.com

Since everything is set let’s make a GET request to the target website and see if it works.

import requests
from bs4 import BeautifulSoup

l=list()
o={}

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}

target_url = "https://www.booking.com/hotel/us/the-lenox.html?checkin=2022-12-28&checkout=2022-12-29&group_adults=2&group_children=0&no_rooms=1&selected_currency=USD"

resp = requests.get(target_url, headers=headers)

print(resp.status_code)

The code is pretty straightforward and needs no explanation but let me explain you a little. First, we imported two libraries that we downloaded earlier in this tutorial then we declared headers and target URLs.

Finally, we made a GET request to the target URL. Once you print you should see a 200 code otherwise your code is not right.

How to scrape the data points

Since we have already decided which data points we are going to scrape let’s find their HTML location by inspecting chrome.

For this tutorial, we will be using the find() and find_all() methods of BeautifulSoup to find target elements. DOM structure will decide which method will be better for each element.

Extracting hotel name and address

Let’s inspect chrome and find the DOM location of the name as well as the address.

image 23

As you can see hotel name can be found under h2 tag with class pp-header__titleFor the sake of simplicity let’s first create a soup variable with the BeautifulSoup constructor and from that we will extract all the data points.

soup = BeautifulSoup(resp.text, 'html.parser')

Here BS4 will use an HTML parser to convert a complex HTML document into a complex tree of python objects. Now, let’s use the soup variable to extract the name and address.

o["name"]=soup.find("h2",{"class":"pp-header__title"}).text

In a similar manner, we will extract the address.

image 24

The address of the property is stored under the span tag with the class name hp_address_subtitle.

o["address"]=soup.find("span",{"class":"hp_address_subtitle"}).text.strip("\n")

Extracting rating and facilities

Once again we will inspect and find the DOM location of the rating and facilities element.

image 25

Rating is stored under the div tag with class d10a6220b4. We will use the same soup variable to extract this element. The following code will extract the rating data.

o["rating"]=soup.find("div",{"class":"d10a6220b4"}).text

Extracting facilities is a bit tricky. We will create a list in which we will store all the facilities HTML elements. After that, we will run a for loop to iterate over all the elements and store individual text in the main array.

image 26

Let’s see how it can be done in two simple steps.

fac=soup.find_all("div",{"class":"important_facility"})

fac variable will hold all the facilities elements. Now, let’s extract them one by one.

for i in range(0,len(fac)):
    fac_arr.append(fac[i].text.strip("\n"))

fac_arr array will store all the text values of the elements. We have successfully managed to extract the main facilities.

Extract Price and Room Types

This part is the most tricky part of the complete tutorial. The DOM structure of booking.com is a bit complex and needs thorough study before extracting price and room type information.

Here tbody tag contains all the data. Just below tbody you will find tr tag, this tag holds all the information from the first column.

image 27

Then going one step down you will find multiple td tags where information like Room Type, price, etc can be found.

image 28

First, let’s find all the tr tags.

ids= list()

targetId=list()
try:
    tr = soup.find_all("tr")
except:
    tr = None

One thing that you will notice is that every tr tag has data-block-id attribute. Let’s collect all those ids in a list.

for y in range(0,len(tr)):
    try:
        id = tr[y].get('data-block-id')

    except:
        id = None

    if( id is not None):
        ids.append(id)

Now, once you have all the ids rest of the job becomes slightly easy. We will iterate over every data-block-id to extract room pricing and room types from their individual tr blocks.

for i in range(0,len(ids)):
    
    try:
       allData = soup.find("tr",{"data-block-id":ids[i]})
    except:
       k["room"]=None
       k["price"]=None

allData variable will store all the HTML data for a particular data-block-id .

image 29

Now, we can move to td tags that can be found inside this tr tag. Let’s extract rooms first.

try:
     rooms = allData.find("span",{"class":"hprt-roomtype-icon-link"})
except:
     rooms=None 

Here comes the fun part, when you have more than one option for a particular room type you have to use the same room for the next set of pricing in the loop. Let me explain to you with the picture.

image 30

Here we have three pricing for one room type. So, when for loop iterates value of the rooms variable will be None. You can see it by printing it. So, we will use the old value of rooms until we receive a new value. I hope you got my point.

if(rooms is not None):
   last_room = rooms.text.replace("\n","")
try:
   k["room"]=rooms.text.replace("\n","")
except:
   k["room"]=last_room

Here last_room will store the last value of rooms until we receive a new value.

Let’s extract the price now.

image 31

Price is stored under the div tag with class “bui-price-display__value prco-text-nowrap-helper prco-inline-block-maker-helper prco-f-font-heading”. Let’s use allData variable to find it and extract the text.

price = allData.find("div",{"class":"bui-price-display__value prco-text-nowrap-helper prco-inline-block-maker-helper prco-f-font-heading"})

k["price"]=price.text.replace("\n","")

We have finally managed to scrape all the data elements that we were interested in.

Complete Code

You can extract other pieces of information like amenities, reviews, etc. You just have to make a few more changes and you will be able to extract them too. Along with this, you can extract other hotel detail also by just changing the unique name of the hotel in the URL.

Basically, the code will look like this.

import requests
from bs4 import BeautifulSoup

l=list()
g=list()
o={}
k={}
fac=[]
fac_arr=[]
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}

target_url = "https://www.booking.com/hotel/us/the-lenox.html?checkin=2022-12-28&checkout=2022-12-29&group_adults=2&group_children=0&no_rooms=1&selected_currency=USD"

resp = requests.get(target_url, headers=headers)

soup = BeautifulSoup(resp.text, 'html.parser')

o["name"]=soup.find("h2",{"class":"pp-header__title"}).text
o["address"]=soup.find("span",{"class":"hp_address_subtitle"}).text.strip("\n")
o["rating"]=soup.find("div",{"class":"d10a6220b4"}).text

fac=soup.find_all("div",{"class":"important_facility"})
for i in range(0,len(fac)):
    fac_arr.append(fac[i].text.strip("\n"))


ids= list()

targetId=list()
try:
    tr = soup.find_all("tr")
except:
    tr = None

for y in range(0,len(tr)):
    try:
        id = tr[y].get('data-block-id')

    except:
        id = None

    if( id is not None):
        ids.append(id)

print("ids are ",len(ids))


for i in range(0,len(ids)):

    try:
        allData = soup.find("tr",{"data-block-id":ids[i]})
        try:
            rooms = allData.find("span",{"class":"hprt-roomtype-icon-link"})
        except:
            rooms=None


        if(rooms is not None):
            last_room = rooms.text.replace("\n","")
        try:
            k["room"]=rooms.text.replace("\n","")
        except:
            k["room"]=last_room

        price = allData.find("div",{"class":"bui-price-display__value prco-text-nowrap-helper prco-inline-block-maker-helper prco-f-font-heading"})
        k["price"]=price.text.replace("\n","")


        
        
        
        g.append(k)
        k={}

    except:
        k["room"]=None
        k["price"]=None


l.append(g)
l.append(o)
l.append(fac_arr)
print(l)

The output of this script should look like this.

image 32

Advantages of Scraping Booking.com

Lots of travel agencies collect a tremendous amount of data from their competitor’s websites. They know if they want to gain an edge in the market they must have access to competitors’ pricing strategies.

Advantages of Scraping Booking.com
Advantages of Scraping Booking.com

To secure an advantage over the niche competitor one has to scrape multiple websites and then aggregate the data. Then finally adjust your own prices after comparing with them. Generate discounts or show on the platform how cheap are your prices alongside your competitor’s prices.

Since there are more than 200 OTAs in the market it becomes a lot more difficult to scrape and compare. I would advise you to use services like hotel search API to get all the prices of all the hotels in any city around the globe.

Conclusion

Obviously, hotel data scraping goes beyond this and this was just an example of how python can be used for scraping Booking.com for price comparison purposes. You can use Python for scraping other websites like Expedia, Hotels.com, etc.

But scraping at scale would not be possible with this process. After some time booking.com will block your IP and your data pipeline will be blocked permanently. For seamless scraping use Web Scraping API which will rotate IPs on every new request and will use headless chrome to reduce any chance of blockage.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name Is Manthan Koolwal and I love to create web scrapers. I have been building them for the last 10 years now. I have created many seamless data pipelines for multiple MNCs now. Right now I am working on Scrapingdog, it's a web scraping API that can scrape any website without blockage at any scale. Feel free to contact me for any web scraping query. Happ Scraping!
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!