< Back to Blog Overview

Web Scraping 101 with Java

16-02-2021

There are a couple of ways to perform web scraping using JAVA. However, we shall go through some of the commonly used methods to scrape data from websites in this case.

Image for post

There are some libraries available to perform web scraping in JAVA. They include:

1. HTMLUnit

2. JSOUP

3. WebMagic

We shall go through the three tools mentioned above to understand the integral details crucial for scraping websites using Java.

Site to Scrape

We are going to scrape the Scrapingdog blog page. We shall obtain the header text from the blog list. After fetching this list, we shall proceed to output it in our program.

Image for post

Before you can scrape a website, you must understand the structure of the underlying HTML. Understanding this structure gives you an idea of how to traverse the HTML tags as you implement your scraper. In your browser, right-click above any element on the blog list. From the menu that will be displayed, select” Inspect Element.” Optionally, you can press Ctrl + Shift +  I to inspect a web page. The list below shows the list elements as they occur repetitively on this page.

Image for post

The image below shows the structuring of a single blog list div element. Our point of interest is the <h2> tag that contains the blog title. To access the h2 tag, we will have to use the following CSS query “div.blog-content div.blog-header a h2”.

Image for post

This tutorial assumes you have basic knowledge in Java and dependency Management in Maven/Gradle.

Web Scraping with Java Using HTMLUnit

Dependencies

HtmlUnit is a GUI-less java library for accessing websites. It is an open-source framework with 21+ contributors actively participating in its development.

To use HtmlUnit, you can download it from Sourceforge, or add it as a dependency in your pom.xml. Add the following dependency code in your maven-based project.

<dependency><groupId>net.sourceforge.htmlunit</groupId><artifactId>htmlunit</artifactId><version>2.48.0-SNAPSHOT</version></dependency>

Moreover, you have to add the following code to your pom distributionManagement section.

<snapshotRepository>
                  <id>sonatype-nexus-snapshots</id><url>https://oss.sonatype.org/content/repositories/snapshots</url>
                    </snapshotRepository>
                      

Procedure

The base URL that we shall be scraping is ’ https://www.scrapingdog.com/blog/. ‘

1. First, we are going to define a web client that we are going to use. HtmlUnit enables you to simulate a web client of choice, Chrome, Mozilla, Safari, etc. In this case, we shall choose Chrome.

//create a chrome web clientWebClient chromeWebClient = new WebClient(BrowserVersion.CHROME);

2. Next, we shall set up configurations for the web client. Defining some of the configurations optimizes the speed of scraping.

This line makes it possible for the web client to use insecure SSL

chromeWebClient.getOptions().setUseInsecureSSL(true);

Next, we disable Javascript exceptions that may arise while scraping the site.

chromeWebClient.getOptions().setThrowExceptionOnFailingStatusCode(false);chromeWebClient.getOptions().setThrowExceptionOnScriptError(false);Moreover, we disable CSS. This optimizes the scraping processchromeWebClient.getOptions().setCssEnabled(false);

3. After configuring the web client, we are now going to fetch the HTML page. In our case, we are going to fetch https://www.scrapingdog.com/blog/.

//fetch the blog page
                              HtmlPage htmlPage = chromeWebClient.getPage(“https://www.scrapingdog.com/blog/");
                                

4. Fetch the DOM Elements of interest using CSS queries. When selecting elements in CSS, we use selectors. Selector references are used to access DOM elements in the page for styling. As we had previously concluded, the selector reference that will give us access to the blog titles in the list is “div.blog-header a h2”.

Using HtmlUnit we shall select all the elements and store them in a DomNodeList.

//fetch the given elements using CSS query selectorDomNodeList<DomNode> blogHeadings = htmlPage.querySelectorAll(“div.blog-header a h2”);

5. Since the individual elements are contained in a DomNode data structure, we shall iterate through the DomNodeList printing out the output of the process.

//loop through the headings printing out the contentfor (DomNode domNode: blogHeadings) {System.out.println(domNode.asText());}

Web Scraping with Java Using JSOUP

JSOUP is an open-source Java HTML parser for working with HTML. It provides an extensive set of APIs for fetching and manipulating fetched data using DOM methods and Query selectors. JSOUP has an active community of  88+ contributors on GitHub.

Dependencies

To use Jsoup, you will have to add its dependency in your pom.xml file.

<dependency><! — jsoup HTML parser library @ https://jsoup.org/<groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.13.1</version></dependency>

Procedure

1. Firstly, we will fetch the web page of choice and store it as a Document data type.

//fetch the web pageDocument page = Jsoup.connect(“https://www.scrapingdog.com/blog/").get();

2. Select the individual page elements using a CSS query selector. We shall select these elements from the page (Document) that we had previously defined.

//selecting the blog headers from the page using CSS queryElements pageElements = page.select(“div.blog-header a h2”);

3. Declare an array list to store the blog headings.

//ArrayList to store the blog headingsArrayList<String> blogHeadings = new ArrayList<>();

4. Create an enhanced for loop to iterate through the fetched elements, “pageElements”, storing them in the array list.

//loop through the fetched page elements adding them to the blogHeadings array listfor (Element e:pageElements) {blogHeadings.add(“Heading: “ + e.text());}

5. Finally, print the contents of the array list.

//print out the array listfor (String s : blogHeadings) {System.out.println(s);}
Image for post

Web scraping with Java using Webmagic

Webmagic is an open-source, scalable crawler framework developed by code craft. The framework boasts developer support of 40+ contributors — the developers based this framework on Scrapy architecture, the python scraping library. Moreover, the team has based several features on Jsoup library.

Dependencies

To use the library, add the following dependencies to your pom.xml file.

<dependency><groupId>us.codecraft</groupId><artifactId>webmagic-core</artifactId><version>0.7.4</version></dependency><dependency><groupId>us.codecraft</groupId><artifactId>webmagic-extension</artifactId><version>0.7.4</version></dependency>

In case you had customized your slf4j implementation, you need to add the following exclusions in your pom.xml.

<exclusions><exclusion><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId></exclusion></exclusions>

Procedure

1. Unlike in the other previously mentioned implementations, we have to implement a Webmagic defined class in creating the class. The PageProcessor class handles the processing of the page after you define it.

//implement PageProcessorpublic class WebMagicCrawler implements PageProcessor {The page processor class implements the following methods@Overridepublic void process(Page page) {}@Overridepublic Site getSite() {}

The process() method handles the various page-related operations whereas the getSite() method returns the site.

2. Define a class variable to hold the site variable. You can define the number of times to retry in this case and the sleep time before the next retry.

private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

However, in our case, we do not need to define all that. We shall use a defined variable.

private Site site = Site.me();

3. After declaring the Site variable, in the overridden getSite() method, add the following piece of code. This makes the method return the previously defined class variable, site.

@Overridepublic Site getSite() {return site;}

4. In the processPage() method, we shall fetch the choice elements and store them in a List.

//fetch all blog headings storing them in a listList<String> rs = page.getHtml().css(“div.blog-header a h2”).all();

5. Like in the previous libraries implementations, we shall print out the contents from our web scraping process by iterating through the string list.

//loop through the list printing out its contentsfor (String s:rs ){System.out.println(“Heading “+ s);}

6. Create a main method, then add the following code.

//define the url to scrape//will run in a separate threadSpider.create(new WebMagicCrawler()).addUrl(“https://www.scrapingdog.com/blog/").thread(5).run();

In the above code, we define the URL to scrape by creating an instance of our class. Moreover, the instance runs in a separate thread.

Image for post

Conclusion

In this tutorial, we guided you through developing a basic web scraper in Java. To avoid reinventing the wheel, there are several scraping libraries that you can use or customize to build your own web scraper. In this tutorial, we developed the scrapers based on the three top Java web scraping libraries, follow the links to learn more about them: HtmlUnitWebmagicJSoup.

All of these libraries are feature-rich, boasting sizeable active community support. Moreover, they are all open source as well. Webmagic happens to be extremely scalable. If you would access the source code for this tutorial, you can follow this link to Github.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website using Java. Here are a few additional resources that you may find helpful during your web scraping journey:

Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!