Scrapingdog
< Back to Blog Overview

Top 5 JavaScript Web Scraping Library

02-06-2020
javascript web scraping library

Web Scraping is a great way to collect large amounts of data in less time. Worldwide data is increasing, and web scraping has become more important for businesses than ever before. In this article, we are going to use JavaScript web scraping libraries and frameworks to scrape web pages. We are going to scrape “Book to Scrape” for demo purposes.

List of Javascript Web Scraping Library

  1. request-promise-native
  2. Unirest
  3. Cheerio
  4. Puppeteer
  5. Osmosis

1. Request-Promise-Native

It is an HTTP client through which you can easily make HTTP calls. It also supports HTTPS & follows redirects by default. Now, let’s see an example of request-promise-native and how it works.

const request = require(‘request-promise-native’);

let scrape = async() =&gt; {
 var respo = await request(‘<a rel="noreferrer noopener" href="http://books.toscrape.com/'" target="_blank">http://books.toscrape.com/'</a>)
 return respo;
}

scrape().then((value) =&gt; {
 console.log(value); // HTML code of the website
});

What are the advantages of using request-promise-native:

  1. It provides proxy support
  2. Custom headers
  3. HTTP Authentication
  4. Support TLS/SSL Protocol

2. Unirest

Unirest is a lightweight HTTP client library from Mashape. Along with JS, it’s also available for Java, .Net, Python, Ruby, etc.

  1. GET request
var unirest = require('unirest');

let scrape = async() =&gt; {
 var respo = await unirest.get(‘<a rel="noreferrer noopener" href="http://books.toscrape.com/'" target="_blank">http://books.toscrape.com/'</a>)
 return respo.body;
}

scrape().then((value) =&gt; {
 console.log(value); // Success!
});

2. POST request

var unirest = require(‘unirest’);

let scrape = async() =&gt; {
 var respo = await unirest.post(‘<a rel="noreferrer noopener" href="https://httpbin.org/anything').headers(%7B'X-header':" target="_blank">http://httpbin.org/anything').headers({'X-header':</a> ‘123’})
 return respo.body;
}

scrape().then((value) =&gt; {
 console.log(value); // Success!
});

Response

{
 args: {},
 data: ‘’,
 files: {},
 form: {},
 headers: {
 ‘Content-Length’: ‘0’,
 Host: ‘httpbin.org’,
 ‘X-Amzn-Trace-Id’: ‘Root=1–5ed62f2e-554cdc40bbc0b226c749b072’,
 ‘X-Header’: ‘123’
 },
 json: null,
 method: ‘POST’,
 origin: ‘23.238.134.113’,
 url: ‘<a href="https://httpbin.org/anything'" rel="noreferrer noopener" target="_blank">http://httpbin.org/anything'</a>
}

3. PUT request

var unirest = require(‘unirest’);

let scrape = async() =&gt; {
 var respo = await unirest.put(‘<a rel="noreferrer noopener" href="https://httpbin.org/anything').headers(%7B'X-header':" target="_blank">http://httpbin.org/anything').headers({'X-header':</a> ‘123’})
 return respo.body;
}

scrape().then((value) =&gt; {
 console.log(value); // Success!
});

Response

{
 args: {},
 data: ‘’,
 files: {},
 form: {},
 headers: {
 ‘Content-Length’: ‘0’,
 Host: ‘httpbin.org’,
 ‘X-Amzn-Trace-Id’: ‘Root=1–5ed62f91-bb2b684e39bbfbb3f36d4b6e’,
 ‘X-Header’: ‘123’
 },
 json: null,
 method: ‘PUT’,
 origin: ‘23.63.69.65’,
 url: ‘<a href="https://httpbin.org/anything'" rel="noreferrer noopener" target="_blank">http://httpbin.org/anything'</a>
}

In the response from POST and PUT requests, you can see I have added a custom header. We add custom headers to customize the result of the response.

Advantages of using Unirest

  1. support all HTTP Methods (GET,POST,DELETE,etc.)
  2. support forms uploads
  3. supports both streaming and callback interfaces
  4. HTTP Authentication
  5. Proxy Support
  6. Support TLS/SSL Protocol

3. Cheerio

In the Cheerio module, you can use jQuery’s syntax while working with downloaded web data. Cheerio allows developers to provide their attention to the downloaded data rather than parsing it. Now, we’ll calculate the number of books available on the first page of the target website.

const cheerio = require(‘cheerio’)

let scrape = async() =&gt; {
 var respo = await request(‘<a rel="noreferrer noopener" href="http://books.toscrape.com/'" target="_blank">http://books.toscrape.com/'</a>)
 return respo;
}

scrape().then((value) =&gt; {

const $ = cheerio.load(value)
 var numberofbooks = $(‘ol[class=”row”]’).find(‘li’).length
 console.log(numberofbooks); // 20!
});

We are finding all the li tags inside the ol tag with class  row.

1 jl Xe H4rHnxLAvoW9763w

Advantages of using Cheerio

  • Familiar syntax: Cheerio implements a subset of core jQuery. It removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
  • Lightening Quick: Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.
  • Stunningly flexible: Cheerio can parse nearly any HTML or XML document.

4. Puppeteer

  • Puppeteer is a Node.js library that offers a simple but efficient API that enables you to control Google’s Chrome or Chromium browser.
  • It also enables you to run Chromium in headless mode (useful for running browsers in servers) and send and receive requests without needing a user interface.
  • It has better control over the Chrome browser as it does not use any external adaptor to control Chrome plus it has Google support too.
  • The great thing is that it works in the background, performing actions as instructed by the API.

We’ll see an example of puppeteer scraping the complete HTML code of our target website.

let scrape = async () =&gt; {
 const browser = await puppeteer.launch({headless: true});
 const page = await browser.newPage(); 

await page.goto(‘<a rel="noreferrer noopener" href="http://books.toscrape.com/" target="_blank">http://books.toscrape.com/</a>'); 

await page.waitFor(1000); 

var result = await page.content(); 

browser.close();
 return result;
};

scrape().then((value) =&gt; {
 console.log(value); // complete HTML code of the target url!
});

What each step means here:

  1. This will launch a chrome browser.
  2. Second-line will open a new tab.
  3. The third line will open that target URL.
  4. We are waiting for 1 second to let the page load completely.
  5. We are extracting all the HTML content of that website.
  6. We are closing the Chrome browser.
  7. returning the results.

Advantages of using Puppeteer

  • Click elements such as buttons, links, and images
  • Automate form submissions
  • Navigate pages
  • Take a timeline trace to find out where the issues are on a website
  • Carry out automated testing for user interfaces and various front-end apps directly in a browser
  • Take screenshots
  • Convert web pages to pdf files

I have explained pretty much everything about Puppeteer over here; please go through the complete article.

5. Osmosis

  • Osmosis is HTML/XML parser and web scraper.
  • It is written in node.js which packed with css3/XPath selector and lightweight HTTP wrapper
  • No large dependencies like Cheerio

We’ll do a simple single-page scrape. We’ll be working with this page on Wikipedia, which contains population information for the US States.

osmosis(‘<a href="https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population').set(%7B" rel="noreferrer noopener" target="_blank">https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population').set({</a> heading: ‘h1’, title: ‘title’}).data(item =&gt; console.log(item));

The response will look like this

{ heading: ‘List of U.S. states and territories by population’, title: ‘List of U.S. states and territories by population — Wikipedia’ }

Advantages of using Osmosis

  • Supports CSS 3.0 and XPath 1.0 selector hybrids
  • Load and search AJAX content
  • Logs URLs, redirects, and errors
  • Cookie jar and custom cookies/headers/user agent
  • Login/form submission, session cookies, and basic auth
  • Single proxy or multiple proxies and handles proxy failure
  • Retries and redirect limits

Choosing the Best Javascript Library for Web Scraping

There are a few things to consider before choosing the best javascript library for web scraping:

  1. Easy to use and have good documentation.
  2. Able to handle a large amount of data.
  3. Able to handle different types of data (e.g., text, images, etc.).
  4. The library should be able to handle different types of web pages (e.g., static, dynamic, etc.).

Conclusion

We understood how we could scrape data with Nodejs using Puppeteer, Osmosis, Request-promise-Native & Unirest regardless of the type of website. Web scraping is set to grow as time progresses. As web scraping applications abound, JavaScript libraries will grow in demand. While there are salient JavaScript libraries, it could be puzzling to choose the right one. However, it would eventually boil down to your own respective requirements.

Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading, and please hit the like button! 👍

Frequently Asked Questions

Q: Is JavaScript better than Python for web scraping?

Ans: JavaScript and Python have their own advantages and disadvantages when it comes to web scraping. However, some general points to consider include the following:

– JavaScript can be more challenging to learn than Python for some people.

– Python is often faster to write code in than JavaScript.

– Python code is typically more concise and easier to read than JavaScript code.

– JavaScript can be more difficult to debug than Python.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name Is Manthan Koolwal and I love to create web scrapers. I have been building them for the last 10 years now. I have created many seamless data pipelines for multiple MNCs now. Right now I am working on Scrapingdog, it's a web scraping API that can scrape any website without blockage at any scale. Feel free to contact me for any web scraping query. Happ Scraping!
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status