Web Scraping is a great way to collect large amounts of data in less time. Worldwide data is increasing and along with that Web Scraping has become more important for businesses than ever before.
In this article, we are going to use JavaScript web scraping libraries and frameworks to scrape web pages. We are going to scrape this website for demo purposes.
In the response from POST and PUT request, you can see I have added a custom header. We add custom headers to customize the result of the response.
Advantages of using Unirest
support all HTTP Methods (GET,POST,DELETE,etc.)
support forms uploads
supports both streaming and callback interfaces
HTTP Authentication
Proxy Support
Support TLS/SSL Protocol
Cheerio
Cheerio module, you will be able to use the syntax of jQuery while working with downloaded web data. Cheerio provides developers with the ability to provide their attention to the downloaded data, rather
than on parsing it.
Now, we’ll calculate the number of books available on the first page of the target website.
We are finding all the li tags inside the ol tag with class
row.
Advantages of using Cheerio
Familiar syntax: Cheerio implements a subset of core jQuery. It removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
Lightening Quick: Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.
Stunningly flexible: Cheerio can parse nearly any HTML or XML document.
Puppeteer
Puppeteer is a Node.js library that offers a simple but efficient API that enables you to control Google’s Chrome or Chromium browser.
It also enables you to run Chromium in headless mode (useful for running browsers in servers) and can send and receive requests without the need for a user interface.
It has better control over the Chrome browser as it does not use any external adaptor to control Chrome plus it has Google support too.
The great thing is that it works in the background, performing actions as instructed by the API.
We’ll see an example of puppeteer to scrape the complete HTML code of our target website.
let scrape = async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage(); await page.goto(‘http://books.toscrape.com/'); await page.waitFor(1000); var result = await page.content(); browser.close();
return result;
};scrape().then((value) => {
console.log(value); // complete HTML code of the target url!
});
What each step means here:
This will launch a chrome browser.
Second-line will open a new tab.
The third line will open that target URL.
We are waiting for 1 second to let the page load completely.
We are extracting all the HTML content of that website.
We are closing the Chrome browser.
returning the results.
Advantages of using Puppeteer
Click elements such as buttons, links, and images
Automate form submissions
Navigate pages
Take a timeline trace to find out where the issues are in a website
Carry out automated testing for user interfaces and various front-end apps, directly in a browser
Take screenshots
Convert web pages to pdf files
I have explained pretty much everything about Puppeteer over here, please go through the complete article.
{ heading: ‘List of U.S. states and territories by population’, title: ‘List of U.S. states and territories by population — Wikipedia’ }
Advantages of using Osmosis
Supports CSS 3.0 and XPath 1.0 selector hybrids
Load and search AJAX content
Logs URLs, redirects, and errors
Cookie jar and custom cookies/headers/user agent
Login/form submission, session cookies, and basic auth
Single proxy or multiple proxies and handles proxy failure
Retries and redirect limits
Conclusion
In this article, we understood how we can scrape data with Nodejs using Puppeteer, Osmosis, Request-promise-Native & Unirest regardless
of the type of website.
Web scraping is set to grow as time progresses. As web scraping applications abound, JavaScript libraries will grow in demand.
While there are salient JavaScript libraries, it could be puzzling to choose the right one. However, it would eventually boil down to your own respective requirements.
Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍
Additional Resources
And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey: