< Back to Blog Overview

Web Scraping 101 with Node.js

2020-12-22

 In this post, we are going to talk about all the tools/libraries offered by  Nodejs  for web scraping. We will first start with some easy and basic libraries and then move ahead with advanced tools. We will briefly talk about the pros and cons of each tool. We will try to highlight every small detail of every tool that can help us while scraping.

Image for post

Libraries

  1. Cheerio
  2. Puppeteer
  3. Playwright
  4. HTTP Client — Axios, Unirest & Superagent
  5. Nightmare

We will talk about the most important thing which has to be kept in mind during data extraction.

Simplecrawler

Example

npm install --save simplecrawler

I have created a scraper.js file in my folder. Inside that file write.

var Crawler = require("simplecrawler");
var crawler = new Crawler("https://books.toscrape.com/");

We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first. You can configure 3 things before scraping the website.

a) Request Interval

crawler.interval = 10000; // Ten seconds

b) Concurrency of requests

crawler.maxConcurrency = 3;

c) Number of links to fetch

crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)// Or:crawler.maxDepth = 2; // First page and discovered links from it are fetched

This library also provides more properties which can be found here.

You’ll also need to set up event listeners for the events you want to listen to. crawler.fetchcomplete and  crawler.complete are good places to start.

crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {  console.log("I just received %s (%d bytes)", queueItem.url,      responseBuffer.length);  console.log("It was a resource of type %s", response.headers['content-type']);});

Then, when you’re satisfied and ready to go, start the crawler! It’ll run through its queue finding linked resources on the domain to download until it can’t find any more.

crawler.start();

Pros

  1. Lots of customization properties available.
  2. easy setup using event listeners.

Cons

  1. Error handling
  2. It will try to fetch the invalid URLs too due to its brute force approach.

Cheerio

Example

First, you have to install cheerio

npm install cheerio

Then type the following code to extract the desired text.

const cheerio = require(‘cheerio’)
const axios = require(‘axios’);
var scraped_data =await axios.get("https://books.toscrape.com/");const $ = cheerio.load(scraped_data.data)
var name = $(".page_inner").first().find("a").text();
console.log(name)
//Books to Scrape
Image for post

First, we have made an HTTP request to the website and then we have stored the data to scraped_data. We will load it in cheerio and then use the class name to get the data.

Pros

  1. Already configured methods are available.
  2. API is fast.

Cons

Puppeteer

Example

npm i puppeteer --save

Then in your scraper.js file write the following code.

const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
var results = await page.goto('https://books.toscrape.com/');
await page.waitFor(1000);
browser.close();
console.log(results);

Awesome! Let's break it down line by line:

First, we create our browser and set headless mode to false. This allows us to watch exactly what is going on:

const browser = await puppeteer.launch({headless: false});

Then, we create a new page in our browser:

const page = await browser.newPage();

Next, we go to the books.toscrape.com URL:

var results = await page.goto('https://books.toscrape.com/');

I’ve added a delay of 1000 milliseconds. While normally not necessary, this will ensure everything on the page loads:

await page.waitFor(1000);

Finally, after everything is done, we’ll close the browser and print our result.

browser.close();
console.log(results);

The setup is complete. Data is ready!

Pros

  • Puppeteer removes the dependency on an external driver to run the tests.

Cons

Image for post
  • Puppeteer is limited to Chrome browser only for now until Firefox support is completed
  • Puppeteer has a smaller testing community using the tool currently, there is more test-specific support for Selenium

Playwright

Example

Now we’ll install Playwright.

npm i playwright

Building a scraper

In your project folder, create a file called scraper.js (or choose any other name) and open it in your favorite code editor. First, we will confirm that Playwright is correctly installed and working by running a simple script.

// Import the playwright library into our scraper.
const playwright = require('playwright');

async function main() {
// Open a Chromium browser. We use headless: false
// to be able to watch what's going on.
const browser = await playwright.chromium.launch({
headless: false
});
// Open a new page / tab in the browser.
const page = await browser.newPage();
// Tell the tab to navigate to the JavaScript topic page.
var results= await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
// Pause for 10 seconds, to see what's going on.
await page.waitForTimeout(10000);
// Turn off the browser to clean up after ourselves.
await browser.close();
}

main();

If you saw a Chromium window open and the Book page successfully loaded congratulations, you just robotized your web browser with Playwright!

Image for post

results variable has all the data. Now, you can use cheerio to get all the information.

Clicking buttons is extremely easy with Playwright. By prefixing text= to a string you’re looking for, Playwright will find the element that includes this string and click it. It will also wait for the element to appear if it’s not rendered on the page yet. This is a huge advantage over puppeteer. Once you have clicked you have to wait for the page to load and then use cheerio to get the information you are looking for. But for now, we are not going in that direction.

Pros

  1. Cross-browser support.
  2. Documentation is great

Cons

HTTP Clients

Axios

var axios = require(‘axios’);
                                                          async function main() {

try{
var scraped_data =await axios.get(“https://books.toscrape.com/");
console.log(scraped_data.data);
//<DOCTYPE HTML>......//
}catch(err){
console.log(err)
}
} main();

You can install Axios through the following command

npm i axios  --save

Pros

  1. Supports promise.
  2. Error handling is great.

Unirest

var unirest = require(‘unirest’);async function main() {


try{
var scraped_data =await unirest.get(“https://books.toscrape.com/");
console.log(scraped_data.body);
//<DOCTYPE HTML>……//
}catch(err){
console.log(err)
}
}main();

You can install Unirest through the following command

npm i unirest --save

Pros

  1. File transfer is simple.
  2. you can send the request directly by providing a callback along with the URL.

Superagent

const superagent = require(‘superagent’);async function main() {


try{
var scraped_data =await superagent.get(“https://books.toscrape.com/");
console.log(scraped_data.text);
//<DOCTYPE HTML>……//
}catch(err){
console.log(err)
}
}main();

You can install Superagent through the following command

npm i superagent --save

Pros

  • Numerous plugins available for many common features
  • Works in both Browser and Node

Cons

Nightmare

Nightmare is an ideal choice over Puppetteer if you don’t like the heavy bundle it comes up with. Your scraper will bypass a lot of the annoying code that can trip it up. Not only that, it means websites that render mostly on the client-side are still scrape-able — if ever you’ve been thrown by needing to make an AJAX request to return a form in your scraping,  today is your day to be awesome!

You can install nightmare library by running the following command:

npm install nightmare

Once Nightmare is installed we will find Scrapingdog’s website link through the Duckduckgo search engine.

const Nightmare = require(‘nightmare’)const nightmare = Nightmare()nightmare
.goto(‘https://duckduckgo.com')
.type(‘#search_form_input_homepage’, ‘Scrapingdog’)
.click(‘#search_button_homepage’)
.wait(‘#links .result__a’)
.evaluate(() => document.querySelector(‘#links .result__a’).href)
.end()
.then((link) => {
console.log(‘Scrapingdog Web Link:’, link)
//Scrapingdog Web Link: https://www.scrapingdog.com/
})
.catch((error) => {
console.error(‘Search failed:’, error)
})

Now, we’ll go line by line. First, we have created an instance of Nightmare. Then we’ll open the Duckduckgo search engine using .goto. Then we will fetch the search bar by using its selector. We have changed the value of the search box to “Scrapingdog” using .type. Once all this is done we are going to submit it. Nightmare will wait till the first link has loaded and after that, it will use the DOM method to get the value of href attribute. After receiving the link it will print it on the console.

Pros

  1. Wait feature to pause the browser.
  2. Lighter than Puppeteer.

Cons

Summary

Feel free to comment and ask me anything. You can follow me on Twitter and  Medium. Thanks for reading and please hit the like button! 👍

Additional Resources

Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!