< Back to Blog Overview

Web Scraping with NodeJS: A Practical Guide

22-12-2020

Web scraping has become very crucial when it comes to data collection. No matter what kind of data you need you must have the skills to collect data from any target website. There are endless use cases of web scraping which makes it a “must-have” skill. You can use it to collect travel data or maybe product pricing from e-commerce websites like Amazon, etc.

web scraping with nodejs full tutorial
Web Scraping with Nodejs

We have written many articles on web scraping and have scraped various different websites with Python but in this article, we will show you how it can be done with Nodejs.

This tutorial will be divided into sections. In the first section we will scrape a website with a normal XHR request and then in the next section we will scrape another website that only loads data after Javascript execution.

Setting up the prerequisites

I am assuming that you have already installed nodejs on your machine. If not then you can do so from here.

For this tutorial, we will mainly need three III-part nodejs libraries.

  1. Unirest– For making XHR requests to the website we are going to scrape.
  2. Cheerio– To parse the data we got by making the request through unirest.
  3. Puppeteer– It is a nodejs library that will be used to control and automate headless Chrome or Chromium browsers.

Before we install these libraries we will have to create a dedicated folder for our project.

mkdir nodejs_tutorial
npm init

npm init will initialize a new Node.js Project. This command will create a package.json() file. Now, let’s install all of these libraries.

npm i unirest cheerio puppeteer

This step will install all the libraries in your project. Create a .js file inside this folder by any name you like. I am using first_section.js.

Now we are all set to create a web scraper. So, for the first section, we are going to scrape this page from books.toscrape.com.

Downloading raw data from books.toscrape.com

Our first job would be to make a GET request to our host website using unirest. Let’s test our setup first by making a GET request. If we get a response code of 200 then we can proceed ahead with parsing the data.

//first_section.js

const unirest = require('unirest');
const cheerio = require('cheerio');

async function scraper(){
  let target_url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
  let data = await unirest.get(target_url)
  
  return {status:data.status}
}

scraper().then((data) => {
  console.log(data)
}).catch((err) => {
  console.log(err)
})

Let me first explain this code to you step by step before running it.

Importing required modules:

  • The unirest module is imported to make HTTP requests and retrieve the HTML content of a web page.
  • The cheerio module is imported to parse and manipulate the HTML content using jQuery-like syntax.

Defining the scraper async function:

  • The scraper function is an async function, which means it can use the await keyword to pause execution and wait for promises to resolve.
  • Inside the function, a target URL (https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) is assigned to the target_url variable.

Making an HTTP GET request and loading the HTML content:

  • The unirest.get() method is used to make an HTTP GET request to the target_url.
  • The await keyword is used to wait for the request to complete and retrieve the response object.
  • The HTML content of the response is accessed through data.body.

Returning the result:

The scraper function returns an object with the status code of the HTTP response (data.status).

Invoking the scraper function:

  • The scraper function is called asynchronously using scraper().then() syntax.
  • The resolved data from the function is logged into the console.
  • Any errors that occur during execution are caught and logged into the console.

I hope you have got an idea of how this code is actually working. Now, let’s run this and see what status we get. You can run the code using the below command.

node first_section.js 

When I run I get a 200 code.

{ status: 200 }

It means my code is ready and I can proceed ahead with the parsing process.

What are we going to scrape?

It is always great to decide in advance what are you going to extract from the target page. This way we can analyze in advance which element is placed where inside the DOM.

We are going to scrape five data elements from this page:

  1. Product Title
  2. Product Price
  3. In stock with quantity
  4. Product rating
  5. Product image.

We will start by making a GET request to this website with our HTTP agent unirest and once the data has been downloaded from the website we can use Cheerio to parse the required data.

With the help of .find() function of Cheerio, we are going to find each data and extract its text value.

Before making the request we are going to analyze the page and find the location of each element inside the DOM. One should always do this exercise to identify the location of each element.

We are going to do this by simply using the developer tool. This can be accessed by simply right-clicking on the target element and then clicking on the inspect. This is the most common method, you might already know this.

Identifying the location of each element

Let’s start by searching for the product title.

The title is stored inside h1 tag. So, this can be easily extracted using Cheerio.

const $ = cheerio.load(data.body);
obj["title"]=$('h1').text()
  • The cheerio.load() function is called, passing in data.body as the HTML content to be loaded.
  • This creates a Cheerio instance, conventionally assigned to the variable $, which represents the loaded HTML and allows us to query and manipulate it using familiar jQuery-like syntax.
  • HTML structure has an <h1> element, the code uses $('h1') to select all <h1> elements in the loaded HTML. But in our case, there is only one.
  • .text() is then called on the selected elements to extract the text content of the first <h1> element found.
  • The extracted title is assigned to the obj["title"] property.

Find the price of the product.

This price data can be found inside the p tag with class name price_color.

obj["price_of_product"]=$('p.price_color').text().trim()

Find the stock data

Stock data can be found inside p tag with the class name instock.

obj["stock_data"]=$('p.instock').text().trim()

Finding the star rating

Here star rating is the name of the class. So, we will first find this class by the name star-rating and then we will find the value of this class attribute using .attr() function provided by the cheerio.

obj["rating"]=$('p.star-rating').attr('class').split(" ")[1]

Finding the image

The image is stored inside an img tag which is located inside the div tag with id product_gallery.

obj["image"]="https://books.toscrape.com"+$('div#product_gallery').find('img').attr('src').replace("../..","")

By adding “https://books.toscrape.com” as a pretext we are completing the URL.

With this, we have managed to find every data we were planning to extract.

Complete Code

You can extract many other things from the page but my main motive was to show you how the combination of any HTTP agent(Unirest, Axios, etc) and Cheerio can make web scraping super simple.

The code will look like this.

const unirest = require('unirest');
const cheerio = require('cheerio');

async function scraper(){
  let obj={}
  let arr=[]
  let target_url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
  let data = await unirest.get(target_url)
  const $ = cheerio.load(data.body);
  obj["title"]=$('h1').text()
  obj["price_of_product"]=$('p.price_color').text().trim()
  obj["stock_data"]=$('p.instock').text().trim()
  obj["rating"]=$('p.star-rating').attr('class').split(" ")[1]
  obj["image"]="https://books.toscrape.com"+$('div#product_gallery').find('img').attr('src').replace("../..","")
  arr.push(obj)
  return {status:data.status,data:arr}
}

scraper().then((data) => {
  console.log(data.data)
}).catch((err) => {
  console.log(err)
})

II Section- Web Scraping Dynamic Websites with Puppeteer

In this section, we are going to scrape Facebook which loads its data through javascript execution. We are going to scrape this page using puppeteer. The page looks like this.

As usual, we should test our setup before starting with the scraping and parsing process.

Downloading the raw data from Facebook

We will write a code that will open the browser and then open the Facebook page which we want to scrape. Then it will close the browser once the page is loaded completely.

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scraper(){
  let obj={}
  let arr=[]
  let target_url = 'https://www.facebook.com/nyrestaurantcatskill/'
  const browser = await puppeteer.launch({headless:false});
  const page = await browser.newPage();
  await page.setViewport({ width: 1280, height: 800 });
  let crop = await page.goto(target_url, {waitUntil: 'domcontentloaded'});



  let data = await page.content();
  await page.waitFor(2000)
  await browser.close();

  return {status:crop.status(),data:data}
}

scraper().then((data) => {
  console.log(data.data)
}).catch((err) => {
  console.log(err)
})

This code snippet demonstrates an asynchronous function named scraper that uses Puppeteer for automating web browsers, to scrape data from a specific Facebook page.

Let’s break down the code step by step:

  1. The function scraper is declared as an asynchronous function. It means that it can use the await keyword to wait for asynchronous operations to complete.
  2. Two variables, obj and arr, are initialized as empty objects and arrays, respectively. These variables are not used in the provided code snippet.
  3. The target_url variable holds the URL of the Facebook page you want to scrape. In this case, it is set to 'https://www.facebook.com/nyrestaurantcatskill/'.
  4. puppeteer.launch({headless:false}) launches a Puppeteer browser instance with the headless option set to false. This means that the browser will have a visible UI when it opens. If you set headless to true, the browser will run in the background without a visible interface.
  5. browser.newPage() creates a new browser tab (page) and assigns it to the page variable.
  6. page.setViewport({ width: 1280, height: 800 }) sets the viewport size of the page to 1280 pixels width and 800 pixels height. This simulates the screen size for the scraping operation.
  7. page.goto(target_url, {waitUntil: 'domcontentloaded'}) navigates the page to the specified target_url. The {waitUntil: 'domcontentloaded'} option makes the function wait until the DOM content of the page is fully loaded before proceeding.
  8. The crop variable stores the result of the page.goto operation, which is a response object containing information about the page load status.
  9. page.content() retrieves the HTML content of the page as a string and assigns it to the data variable.
  10. page.waitFor(2000) pauses the execution of the script for 2000 milliseconds (2 seconds) before proceeding. This can be useful to wait for dynamic content or animations to load on the page.
  11. browser.close() closes the Puppeteer browser instance.
  12. The function returns an object with two properties: status and data. The status property contains the status code of the crop response object, indicating whether the page load was successful or not. The data property holds the HTML content of the page.

Once you run this code you should see this.

If you see this then your setup is ready and we can proceed ahead with data parsing using Cheerio.

What are we going to scrape?

We are going to scrape these five things from the page.

  • Address
  • Phone number
  • Email address
  • Website
  • Rating

Now as usual we are going to first analyze their location inside the DOM. We will take support of Chrome dev tools for this. Then using Cheerio we are going to parse each of them.

Identifying the location of each element

Let’s start with the address first and find out its location.

Once you inspect you will find that all the information we want to scrape is stored inside this div tag with the class name xieb3on. And then inside this div tag, we have two more div tags out of which we are interested in the second one because the information is inside that.

Let’s find this first.

$('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {
  if(i===1){

  }    
})

We have set a condition that only if i is 1 then only it will go inside the condition. With this, we have cleared our intention that we are only interested in the second div block. Now, the question is how to extract an address from this. Well, it become very easy now.

The address can be found inside the first div tag with the class x1heor9g. This div tag is inside the ul tag.

$('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {
  if(i===1){
    obj["address"] = $(el).find('ul').find('div.x1heor9g').first().text().trim()
  }    
})

Let’s find the email, website, phone number, and the rating

All of these four elements are hidden inside div tags with classes xu06os2. All these four div tags are also inside the same ul tag as address.

$(el).find('ul').find('div.xu06os2').each((o,p) => {
        let value =  $(p).text().trim()
        if(value.includes("+")){
          obj["phone"]=value
        }else if(value.includes("Rating")){
          obj["rating"]=value
        }else if(value.includes("@")){
          obj["email"]=value
        }else if(value.includes(".com")){
          obj["website"]=value
        }      

})
arr.push(obj)
obj={}
  1. .find('div.xu06os2') is used to find all <div> elements with the class xu06os2 that are descendants of the previously selected <ul> elements.
  2. .each((o,p) => { ... }) iterates over each of the matched <div> elements, executing the provided callback function for each element.
  3. let value = $(p).text().trim() extracts the text content of the current <div> element (p) and trims any leading or trailing whitespace.
  4. The subsequent if conditions check the extracted value for specific patterns using the .includes() method:

a. If the value includes the “+” character, it is assumed to be a phone number, and it is assigned to the obj["phone"] property.

b. If the value includes the word “Rating”, it is assumed to be a rating value, and it is assigned to the obj["rating"] property.

c. If the value includes the “@” character, it is assumed to be an email address, and it is assigned to the obj["email"] property.

d. If the value includes the “.com” substring, it is assumed to be a website URL, and it is assigned to the obj["website"] property.

5. arr.push(obj) appends the current obj object to the arr array.

6. obj={} reassigns an empty object to the obj variable, resetting it for the next iteration.

Complete Code

Let’s see what the complete code looks like along with the response after running the code.

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scraper(){
  let obj={}
  let arr=[]
  let target_url = 'https://www.facebook.com/nyrestaurantcatskill/'
  const browser = await puppeteer.launch({headless:false});
  const page = await browser.newPage();
  await page.setViewport({ width: 1280, height: 800 });
  let crop = await page.goto(target_url, {waitUntil: 'domcontentloaded'});



  await page.waitFor(5000)
  let data = await page.content();
  await browser.close();
  const $ = cheerio.load(data)


  $('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {

    if(i===1){
      obj["address"] = $(el).find('ul').find('div.x1heor9g').first().text().trim()
      $(el).find('ul').find('div.xu06os2').each((o,p) => {
        let value =  $(p).text().trim()
        if(value.includes("+")){
          obj["phone"]=value
        }else if(value.includes("Rating")){
          obj["rating"]=value
        }else if(value.includes("@")){
          obj["email"]=value
        }else if(value.includes(".com")){
          obj["website"]=value
        }      

      })

      arr.push(obj)
      obj={}
    }
  })

  return {status:crop.status(),data:arr}
}

scraper().then((data) => {
  console.log(data.data)
}).catch((err) => {
  console.log(err)
})

Let’s run it and see the response.

Well, finally we managed to scrape a dynamic website using JS rendering.

Scraping without getting blocked with Scrapingdog

Scrapingdog is a web scraping API that uses new proxy/IP on every new request. Once you start scraping Facebook at scale you will face two challenges.

  • Puppeteer will consume too much CPU. Your machine will get super slow.
  • Your IP will be banned in no time.

With Scrapingdog you can resolve both the issues very easily. It uses headless Chrome browsers to render websites and every request goes through a new IP.

const unirest = require('unirest');
const cheerio = require('cheerio');

async function scraper(){
  let obj={}
  let arr=[]
  let target_url = 'https://api.scrapingdog.com/scrape?api_key=Your-API-Key&url=https://www.facebook.com/nyrestaurantcatskill/'
  let data = await unirest.get(target_ul)
  const $ = cheerio.load(data.body)


  $('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {

    if(i===1){
      obj["address"] = $(el).find('ul').find('div.x1heor9g').first().text().trim()
      $(el).find('ul').find('div.xu06os2').each((o,p) => {
        let value =  $(p).text().trim()
        if(value.includes("+")){
          obj["phone"]=value
        }else if(value.includes("Rating")){
          obj["rating"]=value
        }else if(value.includes("@")){
          obj["email"]=value
        }else if(value.includes(".com")){
          obj["website"]=value
        }

      })

      arr.push(obj)
      obj={}
    }
  })

  return {status:crop.status(),data:arr}
}

scraper().then((data) => {
  console.log(data.data)
}).catch((err) => {
  console.log(err)
})

As you can see you just have to make a simple GET request and your job is done. Scrapingdog will handle everything from headless chrome to retries for you.

If interested you can start with free 1000 credits to try how Scrapingdog can help you collect a tremendous amount of data.

Conclusion

Nodejs is a very powerful language for web scraping and this tutorial is evidence. Of course, you have deep dive a little to better some basics but it is fast and robust.

In this tutorial, we learned how you can scrape both dynamic and non-dynamic websites with nodejs. In the case of dynamic websites, you can also use playwright in place of the puppeteer.

I just wanted to give you an idea of how nodejs can be used for web scraping and I hope with this tutorial you will get a little clarity.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status