Scrapingdog
< Back to Blog Overview

Web Scraping 101 with Java

16-02-2021

There are a couple of ways to perform web scraping using JAVA. However, we shall go through some of the commonly used methods to scrape data from websites in this case.

1 356Nl0p1DH5lvaZlywRRXA

There are some libraries available to perform web scraping in JAVA. They include:

1. HTMLUnit

2. JSOUP

3. WebMagic

We shall go through the three tools mentioned above to understand the integral details crucial for scraping websites using Java.

Site to Scrape

We are going to scrape the Scrapingdog blog page. We shall obtain the header text from the blog list. After fetching this list, we shall proceed to output it in our program.

1 RX6 1NeSkGmLPgb1yhaeA

Before you can scrape a website, you must understand the structure of the underlying HTML. Understanding this structure gives you an idea of how to traverse the HTML tags as you implement your scraper. In your browser, right-click above any element on the blog list. From the menu that will be displayed, select” Inspect Element.” Optionally, you can press Ctrl + Shift +  I to inspect a web page. The list below shows the list elements as they occur repetitively on this page.

1 TJ23Pik nPHmfAFReJETnQ

The image below shows the structuring of a single blog list div element. Our point of interest is the <h2> tag that contains the blog title. To access the h2 tag, we will have to use the following CSS query “div.blog-content div.blog-header a h2”.

1 x NlTymsOwL32EGBu kNA

This tutorial assumes you have basic knowledge in Java and dependency Management in Maven/Gradle.

Web Scraping with Java Using HTMLUnit

Dependencies

HtmlUnit is a GUI-less java library for accessing websites. It is an open-source framework with 21+ contributors actively participating in its development.

To use HtmlUnit, you can download it from Sourceforge, or add it as a dependency in your pom.xml. Add the following dependency code in your maven-based project.

<dependency>

<groupId>net.sourceforge.htmlunit</groupId>

<artifactId>htmlunit</artifactId>

<version>2.48.0-SNAPSHOT</version>

</dependency>

Moreover, you have to add the following code to your pom distributionManagement section.

<snapshotRepository>



                  
<id>sonatype-nexus-snapshots</id>

<url>https://oss.sonatype.org/content/repositories/snapshots</url>




</snapshotRepository>

                      

Procedure

The base URL that we shall be scraping is ’ https://www.scrapingdog.com/blog/. ‘

1. First, we are going to define a web client that we are going to use. HtmlUnit enables you to simulate a web client of choice, Chrome, Mozilla, Safari, etc. In this case, we shall choose Chrome.

//create a chrome web client

WebClient chromeWebClient = new WebClient(BrowserVersion.CHROME);

2. Next, we shall set up configurations for the web client. Defining some of the configurations optimizes the speed of scraping.

This line makes it possible for the web client to use insecure SSL

chromeWebClient.getOptions().setUseInsecureSSL(true);

Next, we disable Javascript exceptions that may arise while scraping the site.

chromeWebClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

chromeWebClient.getOptions().setThrowExceptionOnScriptError(false);

Moreover, we disable CSS. This optimizes the scraping process

chromeWebClient.getOptions().setCssEnabled(false);

3. After configuring the web client, we are now going to fetch the HTML page. In our case, we are going to fetch https://www.scrapingdog.com/blog/.

//fetch the blog page



HtmlPage htmlPage = 
chromeWebClient.getPage(“https://www.scrapingdog.com/blog/");

                           

4. Fetch the DOM Elements of interest using CSS queries. When selecting elements in CSS, we use selectors. Selector references are used to access DOM elements in the page for styling. As we had previously concluded, the selector reference that will give us access to the blog titles in the list is “div.blog-header a h2”.

Using HtmlUnit we shall select all the elements and store them in a DomNodeList.

//fetch the given elements using CSS query selector

DomNodeList<DomNode> blogHeadings = htmlPage.querySelectorAll(“div.blog-header a h2”);

5. Since the individual elements are contained in a DomNode data structure, we shall iterate through the DomNodeList printing out the output of the process.

//loop through the headings printing out the content

for (DomNode domNode: blogHeadings) {

System.out.println(domNode.asText());

}
1 hI8P8Jdbl8Racg I9wMR8A

Web Scraping with Java Using JSOUP

JSOUP is an open-source Java HTML parser for working with HTML. It provides an extensive set of APIs for fetching and manipulating fetched data using DOM methods and Query selectors. JSOUP has an active community of  88+ contributors on GitHub.

Dependencies

To use Jsoup, you will have to add its dependency in your pom.xml file.

<dependency>

<! — jsoup HTML parser library @ https://jsoup.org/ →

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

<version>1.13.1</version>

</dependency>

Procedure

1. Firstly, we will fetch the web page of choice and store it as a Document data type.

//fetch the web page

Document page = Jsoup.connect(“https://www.scrapingdog.com/blog/").get();

2. Select the individual page elements using a CSS query selector. We shall select these elements from the page (Document) that we had previously defined.

//selecting the blog headers from the page using CSS query

Elements pageElements = page.select(“div.blog-header a h2”);

3. Declare an array list to store the blog headings.

//ArrayList to store the blog headings

ArrayList<String> blogHeadings = new ArrayList<>();

4. Create an enhanced for loop to iterate through the fetched elements, “pageElements”, storing them in the array list.

//loop through the fetched page elements adding them to the blogHeadings array list

for (Element e:pageElements) {

blogHeadings.add(“Heading: “ + e.text());

}

5. Finally, print the contents of the array list.

//print out the array list

for (String s : blogHeadings) {

System.out.println(s);

}
1 M1cDMMoTR

Web scraping with Java using Webmagic

Webmagic is an open-source, scalable crawler framework developed by code craft. The framework boasts developer support of 40+ contributors — the developers based this framework on Scrapy architecture, the python scraping library. Moreover, the team has based several features on Jsoup library.

Dependencies

To use the library, add the following dependencies to your pom.xml file.

<dependency>

<groupId>us.codecraft</groupId>

<artifactId>webmagic-core</artifactId>

<version>0.7.4</version>

</dependency>

<dependency>

<groupId>us.codecraft</groupId>

<artifactId>webmagic-extension</artifactId>

<version>0.7.4</version>

</dependency>

In case you had customized your slf4j implementation, you need to add the following exclusions in your pom.xml.

<exclusions>

<exclusion>

<groupId>org.slf4j</groupId>

<artifactId>slf4j-log4j12</artifactId>

</exclusion>

</exclusions>

Procedure

1. Unlike in the other previously mentioned implementations, we have to implement a Webmagic defined class in creating the class. The PageProcessor class handles the processing of the page after you define it.

//implement PageProcessor

public class WebMagicCrawler implements PageProcessor {

The page processor class implements the following methods

@Override

public void process(Page page) {

…

}

@Override

public Site getSite() {

…

}

The process() method handles the various page-related operations whereas the getSite() method returns the site.

2. Define a class variable to hold the site variable. You can define the number of times to retry in this case and the sleep time before the next retry.

private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

However, in our case, we do not need to define all that. We shall use a defined variable.

private Site site = Site.me();

3. After declaring the Site variable, in the overridden getSite() method, add the following piece of code. This makes the method return the previously defined class variable, site.

@Override

public Site getSite() {

return site;

}

4. In the processPage() method, we shall fetch the choice elements and store them in a List.

//fetch all blog headings storing them in a list

List<String> rs = page.getHtml().css(“div.blog-header a h2”).all();

5. Like in the previous libraries implementations, we shall print out the contents from our web scraping process by iterating through the string list.

//loop through the list printing out its contents

for (String s:rs ){

System.out.println(“Heading “+ s);

}

6. Create a main method, then add the following code.

//define the url to scrape

//will run in a separate thread

Spider.create(new WebMagicCrawler()).addUrl(“https://www.scrapingdog.com/blog/").thread(5).run();

In the above code, we define the URL to scrape by creating an instance of our class. Moreover, the instance runs in a separate thread.

1 86cT68lhhcLcozaSaDY Dg

Conclusion

In this tutorial, we guided you through developing a basic web scraper in Java. To avoid reinventing the wheel, there are several scraping libraries that you can use or customize to build your own web scraper. In this tutorial, we developed the scrapers based on the three top Java web scraping libraries, follow the links to learn more about them: HtmlUnitWebmagicJSoup.

All of these libraries are feature-rich, boasting sizeable active community support. Moreover, they are all open source as well. Webmagic happens to be extremely scalable. If you would access the source code for this tutorial, you can follow this link to Github.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website using Java. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the CEO of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status
Wordpress Social Share Plugin powered by Ultimatelysocial
RSS
Follow by Email
Pinterest
fb-share-icon