< Back to Blog Overview

Web Scraping with Rust (A Beginner-Friendly Tutorial)

19-01-2023

In this article, we will learn web scraping through Rust. This programming language isn’t really popular and is not in much use.

This tutorial will focus on extracting data using this programming language and then I will talk about the advantages and disadvantages of using Rust.

web scraping with rust
Web Scraping with Rust

We will scrape this http://books.toscrape.com/ using two popular libraries of Rust reqwest and scraper. We will talk about these libraries in a bit.

At the end of this tutorial, you will have a basic idea of how Rust works and how it can be used for web scraping.

What is Rust?

Rust is a high-level programming language designed by Mozilla. It is built with a main focus on software building. It works great when it comes to low-level memory manipulation like pointers in C and C++.

Concurrent connections are also quite stable when it comes to Rust. Multiple components of the software can run independently and at the same time too without putting too much stress on the server.

Error handling is also top-notch because concurrency errors are compile-time errors instead of run-time errors. This saves time and a proper message will be shown about the error.

This language is also used in game development and blockchain-based tools. Many big companies like AWS, Microsoft, etc already use this language in their architecture.

Setting up The Prerequisites to Web Scrape with Rust

I am assuming that you have already installed Rust and Cargo(package manager of Rust) on your machine and if not then you can refer to this guide for further instructions on installing Rust. First, we have to create a rust project.

cargo new rust_tutorial

Then we have to install two Rust libraries which will be used in the course of this tutorial.

  • reqwest: It will be used for making an HTTP connection with the host website.
  • scraperIt will be used for selecting DOM elements and parsing HTML.

Both of these libraries can be installed by adding them to your cargo.toml file.

[dependencies]

reqwest = "0.10.8"
scraper = "0.12.0"

Both 0.10.8 and 0.12.0 are the latest versions of the libraries. Now finally you can access them in your main project file src/main.rs.

What Are We Scraping using Rust

It is always better to decide what you want to scrape. We will scrape titles and the prices of the individual books from this page.

Scraping titles and price from this page
Scraping titles and price from this page

The process will be pretty straightforward. First, we will inspect chrome to identify the exact location of these elements in the DOM, and then we will use scraper library to parse them out.

Scraping Individual Book Data

Let’s scrape book titles and prices in a step-by-step manner. First, you have to identify the DOM element location.

identifying dom element location
Identifying DOM element location

As you can see in the above book title is stored inside the title attribute of a the tag. Now let’s see where is the price stored.

Identifying price location in page source code
Identifying price location in page source code

Price is stored under the p tag with class price_color. Now, let’s code it in rust and extract this data.

The first step would be to import all the relevant libraries in the main file src/main.rs.

use reqwest::Client;
use scraper::{Html, Selector};

Using reqwest we are going to make an HTTP connection to the host website and using scraper library we are going to parse the HTML content that we are going to receive by making the GET request through reqwest library.

Now, we have to create a client which can be used for sending connection requests using reqwest.

let client = Client::new();

Then finally we are going to send the GET request to our target URL using the client we just created above.

let mut res = client.get("http://books.toscrape.com/")
    .send()
    .unwrap();

Here we have used mut modifier to bind the value to the variable. This improves code readability and once you change this value in the future you might have to change other parts of the code as well.

So, once the request is sent you will get a response in HTML format. But you have to extract that HTML string from res variable using .text().unwrap().unwrap() is like a try-catch thing where it asks the program to deliver the results and if there is any error it asks the program to stop the execution asap.

let body = res.text().unwrap();

Here res.text().unwrap() will return an HTML string and we are storing that string in the body variable.

Now, we have a string through which we can extract all the data we want. Before we use the scraper library we have to convert this string into an scraper::Html object using Html::parse_document.

let document = Html::parse_document(&body);

Now, this object can be used for selecting elements and navigating to the desired element.

First, let’s create a selector for the book title. We are going to use the Selector::parse function to create a scraper::Selector object.

let book_title_selector = Selector::parse("h3 > a").unwrap();

Now this object can be used for selecting elements from the HTML document. We have passed h3 > a as arguments to the parse function. That is a CSS selector for the elements we are interested in. h3 > a means it is going to select all the a tags which are the children of h3 tags.

As you can see in the image that the target a tag is the child of h3 tag. Due to this, we have used h3 > a in the above code.

Since there are so many books we are going to iterate over all of them using the for loop.

for book_title in document.select(&book_title_selector) {
    let title = book_title.text().collect::<Vec<_>>();
    println!("Title: {}", title[0]);
}

select method will provide us with a list of elements that matches the selector book_title_selector. Then we are iterating over that list to find the title attribute and finally print it.

Here Vec<_>> represents a dynamically sized array. It is a vector where you can access any element by its position in the vector.

The next and final step is to extract the price.

let book_price_selector = Selector::parse(".price_color").unwrap();

Again we have used Selector::parse function to create the scraper::Selector object. As discussed above price is stored under the price_color class. So, we have passed this as a CSS selector to the parse function.

Then again we are going to use for loop like we did above to iterate over all the price elements.

for book_price in document.select(&book_price_selector) {
    let price = book_price.text().collect::<Vec<_>>();
    println!("Price: {}", price[0]);
}

Once you find the match of the selector it will get the text and print it on the console.

Finally, we have completed the code which can extract the title and the price from the target URL. Now, once you save this and run the code using cargo run you will get output that looks something like this.

Title: A Light in the Attic
Price: £51.77
Title: Tipping the Velvet
Price: £53.74
Title: Soumission
Price: £50.10
Title: Sharp Objects
Price: £47.82
Title: Sapiens: A Brief History of Humankind
Price: £54.23
Title: The Requiem Red
Price: £22.65
Title: The Dirty Little Secrets of Getting Your Dream Job
Price: £33.34
Title: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Price: £17.93
Title: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
Price: £22.60
Title: The Black Maria
Price: £52.15

Complete Code

You can make more changes to the code to extract other information like star ratings, book images, etc. You can use the same technique of first inspecting and finding the location of the element and then extracting them using the Selector function.

But for now, the code will look like this.

use reqwest::Client;
use scraper::{Html, Selector};

// Create a new client
let client = Client::new();

// Send a GET request to the website
let mut res = client.get("http://books.toscrape.com/")
    .send()
    .unwrap();

// Extract the HTML from the response
let body = res.text().unwrap();

// Parse the HTML into a document
let document = Html::parse_document(&body);

// Create a selector for the book titles
let book_title_selector = Selector::parse("h3 > a").unwrap();

// Iterate over the book titles
for book_title in document.select(&book_title_selector) {
    let title = book_title.text().collect::<Vec<_>>();
    println!("Title: {}", title[0]);
}

// Create a selector for the book prices
let book_price_selector = Selector::parse(".price_color").unwrap();

// Iterate over the book prices
for book_price in document.select(&book_price_selector) {
    let price = book_price.text().collect::<Vec<_>>();
    println!("Price: {}", price[0]);
}

Advantages of using Rust

  • Rust is an efficient programming language like C++. You can build heavy-duty games and Software using it.
  • It can handle a high volume of concurrent calls, unlike python.
  • Rust can even interact with languages like C and Python.

Disadvantages of using Rust

  • Rust is a new language if we compare it to Nodejs and Python. Due to the small community, it becomes very difficult for a beginner to resolve even a small error.
  • Rust syntax is not that easy to understand as compared to Python or Nodejs. So, it becomes very difficult to read and understand the code.

Conclusion

We learned how Rust can be used for web scraping purposes. Using Rust you can scrape many other dynamic websites as well. Even in the above code, you can make a few more changes to scrape images and ratings. This will surely improve your web scraping skills with Rust.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!