Scrapingdog
< Back to Blog Overview

Web Scraping 101 with Node.js

22-12-2020

 In this post, we are going to talk about all the tools/libraries offered by  Nodejs  for web scraping. We will first start with some easy and basic libraries and then move ahead with advanced tools. We will briefly talk about the pros and cons of each tool. We will try to highlight every small detail of every tool that can help us while scraping.

1 OWVQFFnJNwBjOmrZ3t2G1Q

Libraries

  1. Simplecrawler
  2. Cheerio
  3. Puppeteer
  4. Playwright
  5. HTTP Client — Axios, Unirest & Superagent
  6. Nightmare

We will talk about the most important thing which has to be kept in mind during data extraction.

Simplecrawler

Simplecrawler is designed to provide a basic, flexible, and robust API for crawling websites. It was written to archive, analyze, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue. It has a flexible queue system that can be frozen to disk and defrosted.

Example

To understand Simplecrawler we are going to scrape this website. I am assuming that you have installed Nodejs and along with that you have a working directory where we will save our Nodejs script. So, first thing is to install Simplecrawler.

npm install --save simplecrawler

I have created a scraper.js file in my folder. Inside that file write.

var Crawler = require("simplecrawler");
var crawler = new Crawler("https://books.toscrape.com/");

We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first. You can configure 3 things before scraping the website.

a) Request Interval

crawler.interval = 10000; // Ten seconds

b) Concurrency of requests

crawler.maxConcurrency = 3;

c) Number of links to fetch

crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)

// Or:

crawler.maxDepth = 2; // First page and discovered links from it are fetched

This library also provides more properties which can be found here.

You’ll also need to set up event listeners for the events you want to listen to. crawler.fetchcomplete and  crawler.complete are good places to start.

crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {

  console.log("I just received %s (%d bytes)", queueItem.url,      responseBuffer.length); 

 console.log("It was a resource of type %s", response.headers['content-type']);

});

Then, when you’re satisfied and ready to go, start the crawler! It’ll run through its queue finding linked resources on the domain to download until it can’t find any more.

crawler.start();

Pros

  1. Adjust headers and respects robots.txt
  2. Lots of customization properties available.
  3. easy setup using event listeners.

Cons

  1. The biggest disadvantage is it does not support Promise.
  2. Error handling
  3. It will try to fetch the invalid URLs too due to its brute force approach.

Cheerio

Cheerio is a library that is used to parse HTML and XML documents. You can use jquery syntax with the downloaded data. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API. You can filter out the data you want using selectors. Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

Example

We will scrape the header line from this URL which says “Books to Scrape”

First, you have to install cheerio

npm install cheerio

Then type the following code to extract the desired text.

const cheerio = require(‘cheerio’)
const axios = require(‘axios’);

var scraped_data =await axios.get("https://books.toscrape.com/");

const $ = cheerio.load(scraped_data.data)
var name = $(".page_inner").first().find("a").text();
console.log(name)

//Books to Scrape
1 Trz Ry8v7q3DN28gWOEHkw

First, we have made an HTTP request to the website and then we have stored the data to scraped_data. We will load it in cheerio and then use the class name to get the data.

Pros

  1. Data parsing & extraction becomes very easy.
  2. Already configured methods are available.
  3. API is fast.

Cons

  1. It cannot parse JavaScript.

Puppeteer

Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium. It can also be changed to watch the execution live in non-headless mode. It removes the dependency on any external driver to run the operation. Puppeteer provides better control over chrome.

Example

We are going to scrape this website. First, install the puppeteer library.

npm i puppeteer --save

Then in your scraper.js file write the following code.

const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
var results = await page.goto('https://books.toscrape.com/');
await page.waitFor(1000);
browser.close();
console.log(results);

Awesome! Let’s break it down line by line:

First, we create our browser and set headless mode to false. This allows us to watch exactly what is going on:

const browser = await puppeteer.launch({headless: false});

Then, we create a new page in our browser:

const page = await browser.newPage();

Next, we go to the books.toscrape.com URL:

var results = await page.goto('https://books.toscrape.com/');

I’ve added a delay of 1000 milliseconds. While normally not necessary, this will ensure everything on the page loads:

await page.waitFor(1000);

Finally, after everything is done, we’ll close the browser and print our result.

browser.close();
console.log(results);

The setup is complete. Data is ready!

Pros

  • Puppeteer allows access to the measurement of loading and rendering times provided by the Chrome Performance Analysis tool.
  • Puppeteer removes the dependency on an external driver to run the tests.

Cons

  • Puppeteer is limited to Chrome browser only for now until Firefox support is completed
  • Puppeteer has a smaller testing community using the tool currently, there is more test-specific support for Selenium

Playwright

Playwright is a Node.js library to automate Chromium,  Firefox, and WebKit with a single API very similar to puppeteer. Playwright is built to enable cross-browser web automation that is ever-greencapablereliable, and fast. From automating tasks and testing web applications to data mining.

Example

We will build a simple scraper to demonstrate the application of playwright. We will scrape the first book from this URL.

Now we’ll install Playwright.

npm i playwright

Building a scraper

Creating a scraper with Playwright is surprisingly easy, even if you have no previous scraping experience. If you understand JavaScript and CSS, it will be a piece of cake.

In your project folder, create a file called scraper.js (or choose any other name) and open it in your favorite code editor. First, we will confirm that Playwright is correctly installed and working by running a simple script.

// Import the playwright library into our scraper.
const playwright = require('playwright');

async function main() {
// Open a Chromium browser. We use headless: false
// to be able to watch what's going on.
const browser = await playwright.chromium.launch({
headless: false
});
// Open a new page / tab in the browser.
const page = await browser.newPage();
// Tell the tab to navigate to the JavaScript topic page.
var results= await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
// Pause for 10 seconds, to see what's going on.
await page.waitForTimeout(10000);
// Turn off the browser to clean up after ourselves.
await browser.close();
}

main();

If you saw a Chromium window open and the Book page successfully loaded congratulations, you just robotized your web browser with Playwright!

1 9YTcLNziBfDVDVL1dc zng

results variable has all the data. Now, you can use cheerio to get all the information.

Clicking buttons is extremely easy with Playwright. By prefixing text= to a string you’re looking for, Playwright will find the element that includes this string and click it. It will also wait for the element to appear if it’s not rendered on the page yet. This is a huge advantage over puppeteer. Once you have clicked you have to wait for the page to load and then use cheerio to get the information you are looking for. But for now, we are not going in that direction.

Pros

  1. Clicking buttons is way easier than Puppeteer.
  2. Cross-browser support.
  3. Documentation is great

Cons

  1. They have not patched the actual rendering engine.

HTTP Clients

An HTTP Client can be used to send requests to a server and retrieve their responses. We will discuss 3 libraries that are simply used to make an HTTP request to the server or the web page which you are trying to scrape.

Axios

It is a promise-based HTTP client for both browser and node.js. It will provide us with the complete HTML code of the target website. Making a request using Axios is quite simple and straightforward.

var axios = require(‘axios’);



async function main() {
 
 try{
 var scraped_data =await axios.get(“https://books.toscrape.com/");
 console.log(scraped_data.data);



 //<DOCTYPE HTML>......//
 }catch(err){
 console.log(err)
 }

}




 main();
                                                                

You can install Axios through the following command

npm i axios  --save

Pros

  1. It has interceptors to modify the request.
  2. Supports promise.
  3. Error handling is great.

Unirest

Unirest is a set of lightweight HTTP libraries available in multiple languages, built and maintained by Kong, who also maintain the open-source API Gateway Kong. Using Unirest is similar to how we use Axios. You can use it as an alternative for Axios.

var unirest = require(‘unirest’);

async function main() {


 try{
 var scraped_data =await unirest.get(“https://books.toscrape.com/");
 console.log(scraped_data.body);
 //<DOCTYPE HTML>……//
 }catch(err){
 console.log(err)
}

}


main();

You can install Unirest through the following command

npm i unirest --save

Pros

  1. Auto support for gzip
  2. File transfer is simple.
  3. you can send the request directly by providing a callback along with the URL.

Superagent

Small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features. It has a similar API like Axios and supports promise and async/await syntax.

const superagent = require(‘superagent’);

async function main() {


 try{
 var scraped_data =await superagent.get(“https://books.toscrape.com/");
 console.log(scraped_data.text);
 //<DOCTYPE HTML>……//
 }catch(err){
 console.log(err)
 }

}

main();

You can install Superagent through the following command

npm i superagent --save

Pros

  • Multiple functions chaining to send requests.
  • Numerous plugins available for many common features
  • Works in both Browser and Node

Cons

  1. It’s API does not adhere to any standard.

Nightmare

Nightmare is a high-level browser automation library from Segment. It uses Electron(the same Google Chrome-derived framework that powers the Atom text editor) which is similar to PhantomJs but twice as fast and a bit modern too. It was originally designed for automating tasks across sites that don’t have APIs but is most often used for UI testing and crawling.

Nightmare is an ideal choice over Puppetteer if you don’t like the heavy bundle it comes up with. Your scraper will bypass a lot of the annoying code that can trip it up. Not only that, it means websites that render mostly on the client-side are still scrape-able — if ever you’ve been thrown by needing to make an AJAX request to return a form in your scraping,  today is your day to be awesome!

You can install nightmare library by running the following command:

npm install nightmare

Once Nightmare is installed we will find Scrapingdog’s website link through the Duckduckgo search engine.

const Nightmare = require(‘nightmare’)

const nightmare = Nightmare()

nightmare
 .goto(‘https://duckduckgo.com')
 .type(‘#search_form_input_homepage’, ‘Scrapingdog’)
 .click(‘#search_button_homepage’)
 .wait(‘#links .result__a’)
 .evaluate(() => document.querySelector(‘#links .result__a’).href)
 .end()
 .then((link) => {
 console.log(‘Scrapingdog Web Link:’, link)

//Scrapingdog Web Link: https://www.scrapingdog.com/
 })
 .catch((error) => {
 console.error(‘Search failed:’, error)
 })

Now, we’ll go line by line. First, we have created an instance of Nightmare. Then we’ll open the Duckduckgo search engine using .goto. Then we will fetch the search bar by using its selector. We have changed the value of the search box to “Scrapingdog” using .type. Once all this is done we are going to submit it. Nightmare will wait till the first link has loaded and after that, it will use the DOM method to get the value of href attribute. After receiving the link it will print it on the console.

Pros

  1. It’s a great use case for ES7 await keyword.
  2. Wait feature to pause the browser.
  3. Lighter than Puppeteer.

Cons

  1. Undiscovered vulnerabilities may exist in Electron that could allow a malicious website to execute code on your computer.

Summary

So, these were some open source web scraping tools and libraries that you can use for web scraping projects. If you just want to focus on data collection then you can always use web scraping API.

Feel free to comment and ask me anything. You can follow me on Twitter and  Medium. Thanks for reading and please hit the like button! 👍

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the CEO of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status
Wordpress Social Share Plugin powered by Ultimatelysocial
RSS
Follow by Email
Pinterest
fb-share-icon