< Back to Blog Overview

Java Web Scraping 101: How to Get Started

16-02-2021

Java is one of the old and popular programming language. Over the time it has evolved a lot and has been the go to platform for many services and applications.

Web scraping is a process of extracting data from websites and storing it in a format that can be easily accessed and analyzed. It can be used to gather information about a product or service, track competitors, or even to monitor your own website for changes.

Web scraping can be done manually, but it is often more efficient to use a tool or script to automate the process.

There are a couple of ways to perform web scraping using JAVA. However, for the purpose of this guide we will go through some of the commonly used methods.

java web scraping
Web Scraping With Java

JAVA Web Scraping Basics

There are a few different libraries that can be used for web scraping in Java. The most popular ones are Jsoup and HtmlUnit.

In order to scrape a website, you first need to connect to it and retrieve the HTML source code. This can be done using the connect() method in the Jsoup library.

Once you have the HTML source code, you can use the select() method to query the DOM and extract the data you need.

There are some libraries available to perform JAVA Web Scraping. They include:

1. HTMLUnit

2. JSOUP

3. WebMagic

We shall go through the three tools mentioned above to understand the integral details crucial for scraping websites using Java.

Site to Scrape

We are going to scrape the Scrapingdog blog page. We shall obtain the header text from the blog list. After fetching this list, we shall proceed to output it in our program.

web scraping scrapingdog's blog page with java
Web Scraping Scrapingdog’s Blog Page with JAVA

Before you can scrape a website, you must understand the structure of the underlying HTML. Understanding this structure gives you an idea of how to traverse the HTML tags as you implement your scraper.

In your browser, right-click above any element on the blog list. From the menu that will be displayed, select” Inspect Element.” Optionally, you can press Ctrl + Shift +  I to inspect a web page. The list below shows the list elements as they occur repetitively on this page.

copying the HTML

The image below shows the structuring of a single blog list div element. Our point of interest is the <h2> tag that contains the blog title. To access the h2 tag, we will have to use the following CSS query “div.blog-content div.blog-header a h2”.

extracting the html structure

This tutorial assumes you have basic knowledge in Java and dependency Management in Maven/Gradle.

Web Scraping with Java Using HTMLUnit

Dependencies

HtmlUnit is a GUI-less java library for accessing websites. It is an open-source framework with 21+ contributors actively participating in its development.

To use HtmlUnit, you can download it from Sourceforge, or add it as a dependency in your pom.xml.

Add the following dependency code in your maven-based project.

&lt;dependency&gt;

&lt;groupId&gt;net.sourceforge.htmlunit&lt;/groupId&gt;

&lt;artifactId&gt;htmlunit&lt;/artifactId&gt;

&lt;version&gt;2.48.0-SNAPSHOT&lt;/version&gt;

&lt;/dependency&gt;

Moreover, you have to add the following code to your pom distributionManagement section.

&lt;snapshotRepository&gt;



                  
&lt;id&gt;sonatype-nexus-snapshots&lt;/id&gt;

&lt;url&gt;https://oss.sonatype.org/content/repositories/snapshots&lt;/url&gt;




&lt;/snapshotRepository&gt;

                      

Procedure

The base URL that we shall be scraping is ’https://www.scrapingdog.com/blog/

1. First, we are going to define a web client that we are going to use. HtmlUnit enables you to simulate a web client of choice, Chrome, Mozilla, Safari, etc. In this case, we shall choose Chrome.

//create a chrome web client

WebClient chromeWebClient = new WebClient(BrowserVersion.CHROME);

2. Next, we shall set up configurations for the web client. Defining some of the configurations optimizes the speed of scraping.

This line makes it possible for the web client to use insecure SSL

chromeWebClient.getOptions().setUseInsecureSSL(true);

Next, we disable Javascript exceptions that may arise while scraping the site.

chromeWebClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

chromeWebClient.getOptions().setThrowExceptionOnScriptError(false);

Moreover, we disable CSS. This optimizes the scraping process

chromeWebClient.getOptions().setCssEnabled(false);

3. After configuring the web client, we are now going to fetch the HTML page. In our case, we are going to fetch https://www.scrapingdog.com/blog/.

//fetch the blog page



HtmlPage htmlPage = 
chromeWebClient.getPage(“https://www.scrapingdog.com/blog/");

                           

4. Fetch the DOM Elements of interest using CSS queries. When selecting elements in CSS, we use selectors. Selector references are used to access DOM elements in the page for styling.

As we had previously concluded, the selector reference that will give us access to the blog titles in the list is “div.blog-header a h2”.

Using HtmlUnit we shall select all the elements and store them in a DomNodeList.

//fetch the given elements using CSS query selector

DomNodeList&lt;DomNode&gt; blogHeadings = htmlPage.querySelectorAll(“div.blog-header a h2”);

5. Since the individual elements are contained in a DomNode data structure, we shall iterate through the DomNodeList printing out the output of the process.

//loop through the headings printing out the content

for (DomNode domNode: blogHeadings) {

System.out.println(domNode.asText());

}
DOM node extracting
DOM node extracting

Web Scraping with Java Using JSOUP

JSOUP is an open-source Java HTML parser for working with HTML. It provides an extensive set of APIs for fetching and manipulating fetched data using DOM methods and Query selectors. JSOUP has an active community of  88+ contributors on GitHub.

Dependencies

To use Jsoup, you will have to add its dependency in your pom.xml file.

&lt;dependency&gt;

&lt;! — jsoup HTML parser library @ <a href="https://jsoup.org/">https://jsoup.org/</a> →

&lt;groupId&gt;org.jsoup&lt;/groupId&gt;

&lt;artifactId&gt;jsoup&lt;/artifactId&gt;

&lt;version&gt;1.13.1&lt;/version&gt;

&lt;/dependency&gt;

Procedure

1. Firstly, we will fetch the web page of choice and store it as a Document data type.

//fetch the web page

Document page = Jsoup.connect(“https://www.scrapingdog.com/blog/").get();

2. Select the individual page elements using a CSS query selector. We shall select these elements from the page (Document) that we had previously defined.

//selecting the blog headers from the page using CSS query

Elements pageElements = page.select(“div.blog-header a h2”);

3. Declare an array list to store the blog headings.

//ArrayList to store the blog headings

ArrayList&lt;String&gt; blogHeadings = new ArrayList&lt;&gt;();

4. Create an enhanced for loop to iterate through the fetched elements, “pageElements”, storing them in the array list.

//loop through the fetched page elements adding them to the blogHeadings array list

for (Element e:pageElements) {

blogHeadings.add(“Heading: “ + e.text());

}

5. Finally, print the contents of the array list.

//print out the array list

for (String s : blogHeadings) {

System.out.println(s);

}
printing the result
Printing the result

Web scraping with Java using Webmagic

Webmagic is an open-source, scalable crawler framework developed by code craft. The framework boasts developer support of 40+ contributors — the developers based this framework on Scrapy architecture, the python scraping library. Moreover, the team has based several features on Jsoup library.

Dependencies

To use the library, add the following dependencies to your pom.xml file.

&lt;dependency&gt;

&lt;groupId&gt;us.codecraft&lt;/groupId&gt;

&lt;artifactId&gt;webmagic-core&lt;/artifactId&gt;

&lt;version&gt;0.7.4&lt;/version&gt;

&lt;/dependency&gt;

&lt;dependency&gt;

&lt;groupId&gt;us.codecraft&lt;/groupId&gt;

&lt;artifactId&gt;webmagic-extension&lt;/artifactId&gt;

&lt;version&gt;0.7.4&lt;/version&gt;

&lt;/dependency&gt;

In case you had customized your Simple Logging Facade for Java (SLF4J) implementation, you need to add the following exclusions in your pom.xml.

&lt;exclusions&gt;

&lt;exclusion&gt;

&lt;groupId&gt;org.slf4j&lt;/groupId&gt;

&lt;artifactId&gt;slf4j-log4j12&lt;/artifactId&gt;

&lt;/exclusion&gt;

&lt;/exclusions&gt;

Procedure

1. Unlike in the other previously mentioned implementations, we have to implement a Webmagic defined class in creating the class. The PageProcessor class handles the processing of the page after you define it.

//implement PageProcessor

public class WebMagicCrawler implements PageProcessor {

The page processor class implements the following methods

@Override

public void process(Page page) {

…

}

@Override

public Site getSite() {

…

}

The process() method handles the various page-related operations whereas the getSite() method returns the site.

2. Define a class variable to hold the site variable. You can define the number of times to retry in this case and the sleep time before the next retry.

private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

However, in our case, we do not need to define all that. We shall use a defined variable.

private Site site = Site.me();

3. After declaring the Site variable, in the overridden getSite() method, add the following piece of code. This makes the method return the previously defined class variable, site.

@Override

public Site getSite() {

return site;

}

4. In the processPage() method, we shall fetch the choice elements and store them in a List.

//fetch all blog headings storing them in a list

List&lt;String&gt; rs = page.getHtml().css(“div.blog-header a h2”).all();

5. Like in the previous libraries implementations, we shall print out the contents from our web scraping process by iterating through the string list.

//loop through the list printing out its contents

for (String s:rs ){

System.out.println(“Heading “+ s);

}

6. Create a main method, then add the following code.

//define the url to scrape

//will run in a separate thread

Spider.create(new WebMagicCrawler()).addUrl(“https://www.scrapingdog.com/blog/").thread(5).run();

In the above code, we define the URL to scrape by creating an instance of our class. Moreover, the instance runs in a separate thread.

data crawling log
Data crawling log

Troubleshooting Web Scraping with JAVA

If you’re web scraping with Java, and you’re having trouble getting the data you want, there are a few things you can do to troubleshoot the issue.

First, check the code that you’re using to scrape the data. Make sure that it is correctly pulling the data from the website. If you’re not sure how to do this, you can use a web scraping tool like Fiddler or Wireshark to check the code.

If the code is correct, but you’re still not getting the data you want, it could be because the website you’re scraping is blocking Java. To check if this is the case, try opening the website in a different browser, like Chrome or Firefox. If the website doesn’t load, or you can’t access the data you want, then the website is most likely blocking Java.

There are a few ways to get around this issue. One is to use a proxy server. This will allow you to access the website without it knowing that you’re using Java.

Another way to get around this issue is to use a different web scraping tool, like Python or Ruby. These languages are not as commonly blocked by websites.

If you’re still having trouble, you can try reaching out to the website directly and asking them why they’re blocking Java. Sometimes, they may be willing to whitelist your IP address so that you can access the data.

No matter what, don’t give up! With a little troubleshooting, you should be able to get the data you need.

Conclusion

In this tutorial, we guided you through developing a basic web scraper in Java. To avoid reinventing the wheel, there are several scraping libraries that you can use or customize to build your own web scraper. In this tutorial, we developed the scrapers based on the three top Java web scraping libraries.

All of these libraries are feature-rich, boasting sizeable active community support. Moreover, they are all open source as well. Webmagic happens to be extremely scalable. If you would access the source code for this tutorial, you can follow this link to Github.

If you want to learn more about web scraping with Java, I recommend checking out the following resources:

– The Jsoup website: https://jsoup.org/

– The HtmlUnit website: http://htmlunit.sourceforge.net/

– A tutorial on web scraping with Java: https://www.tutorialspoint.com/data_scraping_with_java

Frequently Asked Questions

Python is more versatile language and hence is better for web scraping. Scraping simple website with a simple HTTP request is very easy with Python.

Java and Python are both most popular programming languages. Java is faster but Python is easier and simpler. To tell which one is better all together depends on how you are using them.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website using Java. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the CEO of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!