Understanding customer sentiment through product reviews has become a critical factor for success. Among various platforms, Amazon stands out with its extensive range of products and hence reviews accompanying each listing.
Web scraping Amazon reviews can give valuable feedback, enabling detailed sentiment analysis of products within the same category. By extracting and analyzing these insights, businesses, and developers can mold their offerings to better meet consumer needs and preferences.
In this blog, we will scrape reviews from one of the products from Amazon using Python. We have prepared a guide on scraping Amazon data using Python. Also, we do provide a dedicated API for Scraping Amazon Product Data.
Let’s get Started!
- Better browser support: Playwright supports all modern browsers such as Chrome, Firefox, Webkit, and Edge. While Selenium requires additional setups.
- Easy to use: Playwright is more intuitive and available across different programming languages, making it easier for non-Python developers to start using it.
- Better infrastructure: Playwright can be seamlessly integrated with modern web features like WebSockets and does not require a separate driver process.
Read More: Web Scraping with Selenium
When it comes to the HTML parser, the two most modern solutions are Scrapy and Beautiful Soup. The first is widely used for scaling web scraping infrastructures, it is harder to implement and requires a good knowledge of the library to operate it. The second is easier to navigate and serves as a good alternative for those starting with web scraping. You can install Beautiful Soup with the following pip command:
pip install pytest-playwright playwright install
playwright install-deps
When it comes to the HTML parser, the two most modern solutions are Scrapy and Beautiful Soup. The first is widely used for scaling web scraping infrastructures, it is harder to implement and requires a good knowledge of the library to operate it. The second is easier to navigate and serves as a good alternative for those starting with web scraping. You can install Beautiful Soup with the following pip command:
pip install beautifulsoup4
How to Scrape Amazon Reviews
from playwright.async_api import async_playwright import asyncio import time import json from bs4 import BeautifulSoup
class since we’ll be making asynchronous requests so that each
operation waits for the completion of the previous one. Therefore we also need asyncio
library to run our function
which will be a coroutine
object (async function). The time
package is used to wait on specific pages, before
continuing the script. Finally, we import the json package to save the results as JSON
and the Beautiful Soup
library to parse HTML.
Let’s take a look at the first part of the script: async def main(page_num): async with async_playwright() as pw: # creates an instance of the Chromium browser and launches it browser = await pw.chromium.launch(headless=False) # creates a new browser page (tab) within the browser instance page = await browser.new_page() # save all results in this list all_reviews = [] # pagination while page_num < 200: # URL to crawl URL = ("https://www.amazon.com/" "Tempered-Protector-Cuteey-Leopord-Accessories" "/product-reviews/B08Z6WND9D/" "ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&" f"pageNumber={page_num}" "&reviewerType=all_reviews&pageSize=10") # go to url with Playwright page element await page.goto(URL, timeout=100000) time.sleep(2) # Extract section with the items section = await page.inner_html('div.reviews-content')
We start by making an asynchronous function by setting async before def. The function takes the page number as input. We instantiate the async_playwright() object as pw and we use it to create an instance of the Chromium browser. By setting headless to False we can observe the headless browser interactions in real-time.
We create a page object to apply actions on the website, and a list to save all the scraped reviews. The while loop is used for pagination. In this example, we use 200 as the maximum number of pages to scrape, however, we’ll see later on that the loop breaks, if there are no more reviews to crawl.
The target URL is the Amazon page containing the reviews of the item we want to tackle. You can specify whatever link you want to scrape but the pageNumber is needed to loop through several pages and ensure all reviews are extracted.
The page object provides a .goto() function to navigate to the desired URL. We ensure all content is rendered by setting a timeout. For an extra layer of assurance, we add a time.sleep() to guarantee we capture the entire page’s HTML.
Once the HTML is fully loaded, we use .inner_html() function to grab the section of the full URL, containing the information about the reviews.
The HTML code is saved in the variable section and the following script uses Beautiful Soup to parse content from it. Let’s take a look.
# create Beautiful Soup object soup = BeautifulSoup(section, 'html.parser') # get all items reviews = soup.find_all('div', class_='review') if reviews: for review in reviews: review_text = review.text reviews_split = review_text.split('\n') all_reviews.append(reviews_split[3]) print(reviews_split[3]) else: break page_num += 1 with open("reviews.json", "w") as outfile: json.dump(all_reviews, outfile) await browser.close()
Note that the script above is the continuation of the main function. As mentioned previously, the soup object is created using the section. We are mostly saying that we want to parse only that part of the overall HTML.
We use the .find_all() function with the review class to grab all the reviews within the section.
Once we get the review elements, we check if the list is not empty. If it is, we break the initial while loop, otherwise, we loop through the elements, get the content of each review and save it in the all_reviews list previously created.
The class review we use to obtain the reviews has other text information that is not covered in this article, such as the name of the user who made the review and the rating. This is why we use the .split() Python built-in function to separate the text elements, where the actual review is in the third position. You can also try to find a more precise HTML class instead of using this approach.
Once the loop is finished we increment the page number inside the while loop and the process repeats for the second page. Finally, we save all the list results in a JSON file and we close the browser instance as follows:
with open("reviews.json", "w") as outfile: json.dump(all_reviews, outfile) await browser.close()
“Love the fact that you received a pack of 12 covers for any occasion. Have already helped protect my watch from damage. Wonderful product!”, “Love the colors, easy to switch out and protects my watch face.”, “Not only do they protect my watch but I like to match my watch and clothes. These are easy to change out and quite durable.”, “I always have protection on my Apple Watch because I’m clumsy. For the 12 colors it came with it was about the same price two black ones I have been buying. Granite, I’m only gonna probably use four or so but it’s still cheaper than buying two black ones of similar quality. I will definitely probably buy this again if it stays this cheap.”, “My daughter loves them”
Add Your Heading Text Here
Outside of the function, we use the asyncio package to wrap main() as a coroutine object, otherwise, it will prompt an error:
if __name__ == ‘__main__’: coro = main(1) asyncio.run(coro)
Add Your Heading Text Here
Full Code
In the previous chapter, we split the script into several parts for a better understanding of the web scraping process. The full code can be seen below:
from playwright.async_api import async_playwright import asyncio import time import json from bs4 import BeautifulSoup async def main(page_num): async with async_playwright() as pw: # creates an instance of the Chromium browser and launches it browser = await pw.chromium.launch(headless=False) # creates a new browser page (tab) within the browser instance page = await browser.new_page() # save all results in this list all_reviews = [] # pagination while page_num < 200: # URL to crawl URL = ("https://www.amazon.com/" "Tempered-Protector-Cuteey-Leopord-Accessories" "/product-reviews/B08Z6WND9D/" "ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&" f"pageNumber={page_num}" "&reviewerType=all_reviews&pageSize=10") # go to url with Playwright page element await page.goto(URL, timeout=100000) time.sleep(2) # Extract section with the items section = await page.inner_html('div.reviews-content') # create Beautiful Soup object soup = BeautifulSoup(section, 'html.parser') # get all items reviews = soup.find_all('div', class_='review') if reviews: for review in reviews: review_text = review.text reviews_split = review_text.split('\n') all_reviews.append(reviews_split[3]) else: break page_num += 1 with open("reviews.json", "w") as outfile: json.dump(all_reviews, outfile) await browser.close() if __name__ == '__main__': # start with page 1 coro = main(1) asyncio.run(coro)
Limitations of scraping Amazon reviews with Python
To overcome this we have to use a web scraping API such as Scrapingdog. Scrapingdog can help you scrape millions of such pages without getting blocked. It will handle IP rotations and retries for you so that you can focus on data collection.
Let’s see how you can scrape reviews from Amazon using Scrapingdog with ease.
Scraping Amazon Reviews with Scrapingdog
Using Scrapingdog is super simple, to get started with Scrapingdog you have to sign up for an account. Once you sign up you will get free 1000 API credits which is enough for initial testing.
On successful account creation, you will be redirected to your dashboard where you will find your API key.
import requests import asyncio import time import json from bs4 import BeautifulSoup async def main(page_num): # save all results in this list all_reviews = [] # pagination while page_num < 200: # URL to crawl URL = "https://www.amazon.co.uk/TP-LINK-Tapo-Colour-Changeable-Required-L530B/product-reviews/B08JZHXQC4/ref=cm_cr_arp_d_paging_btm_next_{}?ie=UTF8&reviewerType=all_reviews&pageNumber={}".format(page_num,page_num) target_url = "https://api.scrapingdog.com/scrape?dynamic=false&api_key=Your-API-key&url={}".format(URL) print(URL) section = requests.get(target_url) # create Beautiful Soup object soup = BeautifulSoup(section.text, 'html.parser') # get all items reviews = soup.find_all('div', class_='review') if reviews: for review in reviews: review_text = review.text reviews_split = review_text.split('\n') all_reviews.append(reviews_split[3]) else: break page_num += 1 print(all_reviews) with open("reviews.json", "w") as outfile: json.dump(all_reviews, outfile) if __name__ == '__main__': # start with page 1 coro = main(1) asyncio.run(coro)
Add Your Heading Text Here
In this read, we have looked into how you can scrape Amazon reviews using Python. Having a scraper of your own can be used in small tasks but for commercial purposes, you should use an API. This way you save a lot of effort and time & easily bypass Amazon captchas & hence avoid the IP ban.
Frequently Asked Questions
Yes, Amazon has a mechanism to detect scrapers. To surpass blocking, you can use an API like Scrapingdog.
When the website structure changes, we make sure that we incorporate that structure change in the backend of our API for hassle-free data extraction.
Of course, we offer 1000 free API credits to spin our API. This way you can check the accuracy and response and buy a paid plan if you are satisfied with it. You can sign up for free from here.