< Back to Blog Overview

Web Scraping Amazon Reviews using Python

01-05-2024

Understanding customer sentiment through product reviews has become a critical factor for success. Among various platforms, Amazon stands out with its extensive range of products and hence reviews accompanying each listing.

Web scraping Amazon reviews can give valuable feedback, enabling detailed sentiment analysis of products within the same category. By extracting and analyzing these insights, businesses, and developers can mold their offerings to better meet consumer needs and preferences.

Scraping Amazon Reviews

In this blog, we will scrape reviews from one of the products from Amazon using Python. We have prepared a guide on scraping Amazon data using Python. Also, we do provide a dedicated API for Scraping Amazon Product Data.

Let’s get Started!

Requirements

Scraping websites such as Amazon require a headless browser to automate actions such as clicking, scrolling, and locating specific elements within the website. In Python, there are two major headless browsers: Selenium and Playwright. While the first is powerful and widely used, Playwright is better in certain scenarios such as the following:

  1. Better browser support: Playwright supports all modern browsers such as Chrome, Firefox, Webkit, and Edge. While Selenium requires additional setups.
  2. Easy to use: Playwright is more intuitive and available across different programming languages, making it easier for non-Python developers to start using it.
  3. Better infrastructure: Playwright can be seamlessly integrated with modern web features like WebSockets and does not require a separate driver process.

Read More: Web Scraping with Selenium

In addition, Playwright is overall faster and has extensive well-explained documentation, making it a clever choice when selecting the headless browser. To install this library in your virtual environment, you can run the following:

pip install pytest-playwright
playwright install

If an error prompts, it is probably linked with some dependencies missing, hence you can run the following command:

playwright install-deps

When it comes to the HTML parser, the two most modern solutions are Scrapy and Beautiful Soup. The first is widely used for scaling web scraping infrastructures, it is harder to implement and requires a good knowledge of the library to operate it. The second is easier to navigate and serves as a good alternative for those starting with web scraping. You can install Beautiful Soup with the following pip command:

pip install beautifulsoup4

How to Scrape Amazon Reviews

We first need to specify the libraries to be imported at the beginning of the script:

from playwright.async_api import async_playwright
import asyncio
import time
import json
from bs4 import BeautifulSoup

From the Playwright package, we import the async_api class since we’ll be making asynchronous requests so that each operation waits for the completion of the previous one. Therefore we also need asyncio library to run our function which will be a coroutine object (async function). The time package is used to wait on specific pages, before continuing the script. Finally, we import the json package to save the results as JSON and the Beautiful Soup library to parse HTML.

Let’s take a look at the first part of the script:

async def main(page_num):

    async with async_playwright() as pw:

        # creates an instance of the Chromium browser and launches it
        browser = await pw.chromium.launch(headless=False)

        # creates a new browser page (tab) within the browser instance
        page = await browser.new_page()

        # save all results in this list
        all_reviews = []

        # pagination
        while page_num < 200:

            # URL to crawl
            URL = ("https://www.amazon.com/"
                   "Tempered-Protector-Cuteey-Leopord-Accessories"
                   "/product-reviews/B08Z6WND9D/"
                   "ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&"
                   f"pageNumber={page_num}"
                   "&reviewerType=all_reviews&pageSize=10")

            # go to url with Playwright page element
            await page.goto(URL, timeout=100000)

            time.sleep(2)

            # Extract section with the items
            section = await page.inner_html('div.reviews-content')

We start by making an asynchronous function by setting async before def. The function takes the page number as input. We instantiate the async_playwright() object as pw and we use it to create an instance of the Chromium browser. By setting headless to False we can observe the headless browser interactions in real-time.

We create a page object to apply actions on the website, and a list to save all the scraped reviews. The while loop is used for pagination. In this example, we use 200 as the maximum number of pages to scrape, however, we’ll see later on that the loop breaks, if there are no more reviews to crawl.

amazon product page

The target URL is the Amazon page containing the reviews of the item we want to tackle. You can specify whatever link you want to scrape but the pageNumber is needed to loop through several pages and ensure all reviews are extracted.

The page object provides a .goto() function to navigate to the desired URL. We ensure all content is rendered by setting a timeout. For an extra layer of assurance, we add a time.sleep() to guarantee we capture the entire page’s HTML.

Once the HTML is fully loaded, we use .inner_html() function to grab the section of the full URL, containing the information about the reviews.

The HTML code is saved in the variable section and the following script uses Beautiful Soup to parse content from it. Let’s take a look.

# create Beautiful Soup object
            soup = BeautifulSoup(section, 'html.parser')

            # get all items
            reviews = soup.find_all('div', class_='review')

            if reviews:
                for review in reviews:
                    review_text = review.text
                    reviews_split = review_text.split('\n')
                    all_reviews.append(reviews_split[3])
                    print(reviews_split[3])
            else:
                break

            page_num += 1

        with open("reviews.json", "w") as outfile:
            json.dump(all_reviews, outfile)

        await browser.close()

Note that the script above is the continuation of the main function. As mentioned previously, the soup object is created using the section. We are mostly saying that we want to parse only that part of the overall HTML.

We use the .find_all() function with the review class to grab all the reviews within the section.

Once we get the review elements, we check if the list is not empty. If it is, we break the initial while loop, otherwise, we loop through the elements, get the content of each review and save it in the all_reviews list previously created.

The class review we use to obtain the reviews has other text information that is not covered in this article, such as the name of the user who made the review and the rating. This is why we use the .split() Python built-in function to separate the text elements, where the actual review is in the third position. You can also try to find a more precise HTML class instead of using this approach.

Once the loop is finished we increment the page number inside the while loop and the process repeats for the second page. Finally, we save all the list results in a JSON file and we close the browser instance as follows:

 with open("reviews.json", "w") as outfile:
            json.dump(all_reviews, outfile)

 await browser.close()

See below a portion of the JSON output for the target URL:

“Love the fact that you received a pack of 12 covers for any occasion. Have already helped protect my watch from damage. Wonderful product!”,
“Love the colors, easy to switch out and protects my watch face.”,
“Not only do they protect my watch but I like to match my watch and clothes. These are easy to change out and quite durable.”,
“I always have protection on my Apple Watch because I’m clumsy. For the 12 colors it came with it was about the same price two black ones I have been buying. Granite, I’m only gonna probably use four or so but it’s still cheaper than buying two black ones of similar quality. I will definitely probably buy this again if it stays this cheap.”,
“My daughter loves them”

Outside of the function, we use the asyncio package to wrap main() as a coroutine object, otherwise, it will prompt an error:

if __name__ == ‘__main__’:
 coro = main(1)
 asyncio.run(coro)

Full Code

In the previous chapter, we split the script into several parts for a better understanding of the web scraping process. The full code can be seen below:

from playwright.async_api import async_playwright
import asyncio
import time
import json
from bs4 import BeautifulSoup


async def main(page_num):

    async with async_playwright() as pw:

        # creates an instance of the Chromium browser and launches it
        browser = await pw.chromium.launch(headless=False)

        # creates a new browser page (tab) within the browser instance
        page = await browser.new_page()

        # save all results in this list
        all_reviews = []

        # pagination
        while page_num < 200:

            # URL to crawl
            URL = ("https://www.amazon.com/"
                   "Tempered-Protector-Cuteey-Leopord-Accessories"
                   "/product-reviews/B08Z6WND9D/"
                   "ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&"
                   f"pageNumber={page_num}"
                   "&reviewerType=all_reviews&pageSize=10")

            # go to url with Playwright page element
            await page.goto(URL, timeout=100000)

            time.sleep(2)

            # Extract section with the items
            section = await page.inner_html('div.reviews-content')

            # create Beautiful Soup object
            soup = BeautifulSoup(section, 'html.parser')

            # get all items
            reviews = soup.find_all('div', class_='review')

            if reviews:
                for review in reviews:
                    review_text = review.text
                    reviews_split = review_text.split('\n')
                    all_reviews.append(reviews_split[3])
            else:
                break

            page_num += 1

        with open("reviews.json", "w") as outfile:
            json.dump(all_reviews, outfile)

        await browser.close()

if __name__ == '__main__':
    # start with page 1
    coro = main(1)
    asyncio.run(coro)

You can see below a preview of the script running:

Limitations of scraping Amazon reviews with Python

The above approach is nice but it cannot be used for mass scraping. Amazon has a very sophisticated scraper detection system. Once it identifies the bot it will show you a captcha screen which can break your data pipeline very easily.

amazon throwing captcha

To overcome this we have to use a web scraping API such as Scrapingdog. Scrapingdog can help you scrape millions of such pages without getting blocked. It will handle IP rotations and retries for you so that you can focus on data collection.

Let’s see how you can scrape reviews from Amazon using Scrapingdog with ease.

Scraping Amazon Reviews with Scrapingdog

Using Scrapingdog is super simple, to get started with Scrapingdog you have to sign up for an account. Once you sign up you will get free 1000 API credits which is enough for initial testing.

On successful account creation, you will be redirected to your dashboard where you will find your API key.

scrapingdog dashboard

Using this API key you can easily integrate Scrapingdog within your coding environment. For now, we will scrape Amazon reviews using Scrapingdog in our Python environment. You can even refer to the Docs before proceeding with the coding.

import requests
import asyncio
import time
import json
from bs4 import BeautifulSoup

async def main(page_num):

    # save all results in this list
    all_reviews = []

    # pagination
    while page_num < 200:

        # URL to crawl
        URL = "https://www.amazon.co.uk/TP-LINK-Tapo-Colour-Changeable-Required-L530B/product-reviews/B08JZHXQC4/ref=cm_cr_arp_d_paging_btm_next_{}?ie=UTF8&reviewerType=all_reviews&pageNumber={}".format(page_num,page_num)

        target_url = "https://api.scrapingdog.com/scrape?dynamic=false&api_key=Your-API-key&url={}".format(URL)
        print(URL)
        section = requests.get(target_url)
        # create Beautiful Soup object
        soup = BeautifulSoup(section.text, 'html.parser')

        # get all items
        reviews = soup.find_all('div', class_='review')

        if reviews:
            for review in reviews:
                review_text = review.text
                reviews_split = review_text.split('\n')
                all_reviews.append(reviews_split[3])
        else:
            break

        page_num += 1
        print(all_reviews)
    with open("reviews.json", "w") as outfile:
        json.dump(all_reviews, outfile)






if __name__ == '__main__':
    # start with page 1
    coro = main(1)
    asyncio.run(coro)

As you can see the code has become very simple because we have removed playwright and JS rendering will be handled by Scrapingdog.

Conclusion

In this read, we have looked into how you can scrape Amazon reviews using Python. Having a scraper of your own can be used in small tasks but for commercial purposes, you should use an API. This way you save a lot of effort and time & easily bypass Amazon captchas & hence avoid the IP ban.

Frequently Asked Questions

When the website structure changes, we make sure that we incorporate that structure change in the backend of our API for hassle-free data extraction.

Of course, we offer 1000 free API credits to spin our API. This way you can check the accuracy and response and buy a paid plan if you are satisfied with it. You can sign up for free from here.

Additional Resources

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status