< Back to Blog Overview

Web Scraping with Ruby: Tips, Tricks, and Techniques

25-11-2022

Web Scraping is the extraction of HTML data from a web page. It is used to collect precious data. We are going to web scrape with Ruby on Rails job listings from Blockwork.

web scraping with ruby
Web Scraping with Ruby

Here is the URL for that web page which we will be going to scrape with the help of Ruby in this tutorial.

blockwork website
Block Work Website

Ruby Web Scraping

Web scraping is a technique used to extract data from websites. It can be used to extract information such as product details, prices, contact information, and more.

Scraping data from websites can be a tedious and time-consuming task, especially if the website is not well structured. This is where Ruby comes in handy. Ruby is a powerful programming language that makes it easy to process and extract data from websites.

With Ruby, you can use the Nokogiri and Open-URI gems to easily extract data from websites. Nokogiri is a Ruby gem that makes it easy to parse and search HTML documents. Open-URI is a Ruby gem that allows you to open and read files from the web.

In this article, we will learn how to scrape websites using the Ruby programming language. We will use the Nokogiri and Open-URI gems to make our life easier. We will also look at how to scrape paginated data.

Ruby Scraper Library

Ruby has a few different libraries that can be used for web scraping, but one of the most popular is Nokogiri. Nokogiri is a Ruby gem that can be used to parse HTML and XML documents. It’s fast and easy to use, and it has a ton of features that make it perfect for web scraping.

One of the best things about Nokogiri is that it can be used with a variety of different Ruby gems, so you can always find a way to get the data you need. For example, if you need to scrape a website that uses JavaScript, you can use Nokogiri with the therubyracer gem to parse the JavaScript and get the data you need.

Nokogiri is also very well-documented, so you can always find what you need if you get stuck. There’s a great community of Nokogiri users who are always willing to help, so you’ll never feel lost when you’re using this gem.

Let’s Go…

Setup

Our setup is pretty simple. Just type the below commands in cmd.

mkdir scraper<br>cd scraper<br>touch GemFile<br>touch scraper.rb

Now, you can open this scraper folder in any of your favorite code editors. I will use Atom. Inside our scraper folder, we have our GemFile and a scraper file.

For our scaper, we are going to use a couple of gems. So, the first thing I want to do is jump into the gem file we just created and I am going to add a couple of things. We are going to add three gems one is an HTTP party, another one is Nokogiri and the last one is Byebug.

source "https://rubygems.org"gem "httparty"

gem "nokogiri"
gem "byebug"

Now, go back to your cmd and install all the gems using

bundle install

After this everything is set and a file name gemfile.lock has been created in our working folder. Or setup is complete.

Preparing the Food

Now we are going to start writing our scraper in scraper.rb file. Before we start writing our scraper I am going to require the dependencies that we just added into our gem file. So, we’ll add nokogiri,  byebug, and httparty.

require 'nokogiri'<br>require 'httparty'<br>require 'byebug'

I am going to create a new method and call it scraper and this is where all of our scraper functionality is going to live.

def scraper
   url = "<a href="https://blockwork.com/" target="_blank" rel="noreferrer noopener">https://blockwork.com/</a>"
   unparsed_page = Httparty.get(url)
   parse_page = Nokogiri::HTML(unparsed_page)
   byebug

end

scraper

We have declared a variable inside the function by the name URL and then to make an HTTP GET request to this URL we are going to use httparty. 

After HTTP call, we’ll get raw HTML source code from that web page. So, what we can do next is we can bring Nokogiri and we can actually parse that page.

So let’s create another variable called parse_page. Nokogiri will provide us with a format from which we can start to extract data out of the raw HTML. Then we used Byebug. It will set up a debugger that lets us interact with some of these variables. Once we have added that we can jump back to our cmd.

ruby scraper.rb<br>parse_page #on hitting byebug

On writing “parsed_page” after hitting byebug we’ll get…

1 KpXEcsq9scfXNLrjG5A01A

Here, we can use nokogiri to interact with this data. So, this is where things get pretty cool. Using Nokogiri we can target various items on the page like classes, IDs, etc. We’ll inspect the job page and we’ll find the class associated with each job block.

1 cI1fFHJJLB0H8zew 2JAKg

On inspection, we see that every job has a class “listingCard”.

In cmd type

jobCards = parsed_page.css(‘div.listingCard’)
1 KpXEcsq9scfXNLrjG5A01A

Now, if you will type jobCards.first in the terminal it will show the result for the first job block. To extract the position, Location, Company, and the URL to apply we can dig a little bit deeper into this using CSS.

1 LnyLbxO osMGbi6FwHiwNQ
#Coming back to scraper.rb

def scraper
   url = "<a href="https://blockwork.com/" target="_blank" rel="noreferrer noopener">https://blockwork.com/</a>"
   unparsed_page = Httparty.get(url)
   parse_page = Nokogiri::HTML(unparsed_page)
   jobs = Array.new
   job_listings = parsed_page.css("div.lisingCard")
   job_listings.each do [job_listing]
      job = {
             title:job_listing.css('span.job-title'),
             company: job_listing.css('span.company'),
             location:job_listing.css('span.location'),
             url:"https://blockwork.cc" + job_listing.css('a')[0].attributes['href'].value
       }
       jobs == job
   end
   byebug

end

scraper

We have created a variable job_listings which contains all the top 50 job postings on the page. And then we basically want to pass that data into an array. We have created a job object which will hold all the individual company details.

Now, we can iterate over 50 jobs on a page and we should be able to extract the data that we are trying to target out of each of those jobs. A jobs array has been declared to store all 50 job listings one by one. Now, we can run our script on cmd to check all 50 listings.

ruby scraper.rb<br>jobs #After hitting the byebug
ruby scraping gem

We have managed to scrape the first page but what if we want to scrape all the pages?

Scraping Every Page

We have to make our scraper a little more intelligent. We are going to make a few tweaks to our web scraper. Here we will take pagination into account and we’ll scrape all the listings on this site instead of just 50 per page.

There are a couple of things we want to know in order to make this work. The first is basically just how many listings are getting served on each page. So, we already know that it’s 50 listings per page. The other thing we want to figure out is the total number of listings on the site. We already know that we have 2287 listings on the site.

#Coming back to scraper.rb

def scraper
   url = "<a href="https://blockwork.com/" target="_blank" rel="noreferrer noopener">https://blockwork.com/</a>"
   unparsed_page = Httparty.get(url)
   parse_page = Nokogiri::HTML(unparsed_page)
   jobs = Array.new
   job_listings = parsed_page.css("div.lisingCard") 

  page = 1 

  per_page = job_listings.count #50
   total = parsed_page.css('div.job-count').text.split(' ')[1].gsub(',','').to_i  #2287
   last_page = (total.to_f / per_page.to_f).round

while page &lt;= last_page
     pagination_url = "https://blockwork.cc/listings?page=#{page}"
     
     pagination_unparsed_page = Httparty.get(pagination_url)
     pagination_parse_page = Nokogiri::HTML(pagination_unparsed_page)
     pagination_job_listings = pagination_parsed_page.css("div.lisingCard")

     pagination_job_listings.each do [job_listing]
       job = {
             title:job_listing.css('span.job-title'),
             company: job_listing.css('span.company'),
             location:job_listing.css('span.location'),
             url:"https://blockwork.cc" + job_listing.css('a')[0].attributes['href'].value
        }

        jobs &lt;&lt; job
     end

     page += 1  

   end
   byebug

end

scraper

per_page will calculate the job listings on a page and the total will calculate the total number of job postings. We should avoid making it hardcoded. last_page will determine the last page number. We have declared a while loop which will stop when the page will become equal to the last_pagepagination_url will provide a new URL for every page value. Then the same logic will be followed as what we used while scraping the first page. Array jobs will contain all the jobs present on the website.

1 jcsLYV6H9DkAXZG6BmVU7Q

So, basically, just like that, we can build a simple and powerful web scraper using Ruby and Nokogiri.

Conclusion

In this article, we understood how we can scrape data using Ruby and Nokogiri. Once you start playing with it you can do a lot with Ruby. Ruby on Rails makes it easy to modify the existing code or add new features. Ruby is a concise language, when combined with 3rd party libraries, allows you to develop features incredibly fast. It is one of the most productive programming languages around.

Feel free to comment and ask me anything. You can follow me on Twitter and  Medium. Thanks for reading and please hit the like button! 👍

Frequently Asked Questions

Q: Is Ruby better than Python?

Ans: In general, Ruby is better for web scraping than Python because it is easier to use and more concise. However, Python is more powerful and can be more efficient for certain tasks.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the CEO of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!