Web Scraping with Ruby

Web Scraping
Web Scraping is the extraction of HTML data from a web page. It is used to collect some precious data. We are going to scrape Ruby on Rails job listings from Blockwork. Here is the URL for that web page.

Setup
Our setup is pretty simple. Just type the below commands in cmd.
mkdir scraper
cd scraper
touch GemFile
touch scraper.rb
Now, you can open this scraper folder in any of your favorite code editors. I will use Atom. Now inside our scraper folder, we have our GemFile and a scraper file. For our scaper, we are going to use a couple of gems. So, the first thing I want to do is jump into the gem file we just created and I am going to add a couple of things. We are going to add three gems one is an HTTP party, another one is Nokogiri and the last one is Byebug.
source "https://rubygems.org"gem "httparty"
gem "nokogiri"
gem "byebug"
Now, go back to your cmd and install all the gems using
bundle install
After this everything is set and a file name gemfile.lock has been created in our working folder. Or setup is complete.
Preparing the Food
Now we are going to start writing our scraper in scraper.rb file. Before we start writing our scraper I am going to require the dependencies that we just added into our gem file. So, we’ll add nokogiri, byebug, and httparty.
require 'nokogiri'
require 'httparty'
require 'byebug'
I am going to create a new method and call it scraper and this is where all of our scraper functionality is going to live.
def scraper
url = "https://blockwork.cc/"
unparsed_page = Httparty.get(url)
parse_page = Nokogiri::HTML(unparsed_page)
byebugendscraper
We have declared a variable inside the function by the name URL and then to make an HTTP GET request to this URL we are going to use httparty. After HTTP call, we’ll get raw HTML source code from that web page. So, what we can do next is we can bring Nokogiri and we can actually parse that page. So let’s create another variable called parse_page. Nokogiri will provide us a format from where we can start to extract data out of the raw HTML. Then we have used Byebug. It will set a debugger that lets us interact with some of these variables. Once we have added that we can jump back to our cmd.
ruby scraper.rb
parse_page #on hitting byebug
On writing “parsed_page” after hitting byebug we’ll get…

Here, we can use nokogiri to interact with this data. So, this is where things get pretty cool. Using Nokogiri we can target various items on the page like classes, IDs, etc. We’ll inspect the job page and we’ll find the class associated with each job block.

On inspection, we see that every job has a class “listingCard”.
In cmd type
jobCards = parsed_page.css(‘div.listingCard’)

Now, if you will type jobCards.first in the terminal it will show the result for the first job block. To extract position, Location, Company, and the URL to apply we can dig a little bit deeper into this using CSS.

#Coming back to scraper.rbdef scraper
url = "https://blockwork.cc/"
unparsed_page = Httparty.get(url)
parse_page = Nokogiri::HTML(unparsed_page)
jobs = Array.new
job_listings = parsed_page.css("div.lisingCard")
job_listings.each do [job_listing]
job = {
title:job_listing.css('span.job-title'),
company: job_listing.css('span.company'),
location:job_listing.css('span.location'),
url:"https://blockwork.cc" + job_listing.css('a')[0].attributes['href'].value
}
jobs == job
end
byebugendscraper
We have created a variable job_listings which contains all the top 50 job postings on the page. And then we basically want to pass that data into an array. We have created a job object which will hold all the individual company details. Now, we can iterate over 50 jobs on a page and we should be able to extract the data that we are trying to target out of each of those jobs. A jobs array has been declared to store all the 50 job listings one by one. Now, we can run our script on cmd to check all the 50 listings.
ruby scraper.rb
jobs #After hitting the byebug

We have managed to scrape the first page but what if we want to scrape all the pages?
Scraping Every Page
We have to make our scraper a little more intelligent. We are going to make a few tweaks to our web scraper. Here we will take pagination into account and we’ll scrape all the listings on this site instead of just 50 per page. There are a couple of things we want to know in order to make this work. The first is basically just how many listings are getting served on each page. So, we already know that it’s 50 listings per page. The other thing we want to figure out is the total number of listings on the site. We already know that we have 2287 listings on the site.
#Coming back to scraper.rbdef scraper
url = "https://blockwork.cc/"
unparsed_page = Httparty.get(url)
parse_page = Nokogiri::HTML(unparsed_page)
jobs = Array.new
job_listings = parsed_page.css("div.lisingCard") page = 1 per_page = job_listings.count #50
total = parsed_page.css('div.job-count').text.split(' ')[1].gsub(',','').to_i #2287
last_page = (total.to_f / per_page.to_f).roundwhile page <= last_page
pagination_url = "https://blockwork.cc/listings?page=#{page}"
pagination_unparsed_page = Httparty.get(pagination_url)
pagination_parse_page = Nokogiri::HTML(pagination_unparsed_page)
pagination_job_listings = pagination_parsed_page.css("div.lisingCard") pagination_job_listings.each do [job_listing]
job = {
title:job_listing.css('span.job-title'),
company: job_listing.css('span.company'),
location:job_listing.css('span.location'),
url:"https://blockwork.cc" + job_listing.css('a')[0].attributes['href'].value
}
jobs << job
end
page += 1 end
byebugendscraper
per_page will calculate the job listings on a page and the total will calculate the total number of job postings. We should avoid making it hardcoded. last_page will determine the last page number. We have declared a while loop which will stop when the page will become equal to the last_page. pagination_url will provide a new URL for every page value. Then the same logic will be followed as what we used while scraping the first page. Array jobs will contain all the jobs present on the website.

So, basically, just like that, we can build a simple and powerful web scraper using Ruby and Nokogiri.
Conclusion
In this article, we understood how we can scrape data using Ruby and Nokogiri. Once you start playing with it you can do a lot with Ruby. Ruby on Rails makes it easy to modify the existing code or add new features. Ruby is a concise language, when combined with 3rd party libraries, allows you to develop features incredibly fast. It is one of the most productive programming languages around.
Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍
Additional Resources
And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey: