< Back to Blog Overview

Web Scraping with Go

09-01-2022

 In this post, we are going to learn web scraping using Golang. But before that, we need to understand what is web scraping.

What is Web Scraping?

The web is composed of web pages and websites that contain data of all sorts. So, when you plug this data from a web page or a website this act is referred to as web scraping. This process is automated using specially written programs called web scrapers, which are designed to scrape data in large amounts. With web scrapers, you only have to give them an idea of what data you are interested in, and then you can step back and let them do the work for you.

In case you need to scrape data automatically from multiple websites across the web then you need to alter your program such that it goes around the web and checks websites for the desired data. If the data meets the data requirements then it is scraped otherwise it is skipped and the process repeats. Such programs are called web crawlers.

Now, we have some idea about the topic let’s make a web scraper.

Web Scraping with Go (Step by Step)

So, we start by creating a file by the name ghostscraper.go. We are going to use colly framework for this. It is a very well-written framework and I highly recommend you to read its documentation. To install it we can copy the single line command and throw it in our terminal or our command prompt. It takes a little while and it gets installed.

go get -u github.com/gocolly/colly/...

Now, switch back to your file. We begin with specifying the package and then we can write our main function.

package mainfunc main () {
}

Try to run this code just to verify everything is ok.

Now, the first thing we need inside the function is a filename.

package mainfunc main () {
fName:= "data.csv"
}

Now, that we have a file name, we can create a file.

package mainfunc main () {
fName:= "data.csv"
file, err := os.Create(fName)
}

This will create a file by the name data.csv. Now, that we have created a file we need to check for any errors.

If there were any errors during the process, this is how you can catch them.

package mainfunc main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}}

Fatalf() basically prints the message and exits the program.

The last thing you do with a file is close it.

package mainfunc main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
}

Now, here defer is very helpful. Once you write defer anything following that will be executed afterward and not right away. So, here once we are done working with the file go will close the file for us. Isn’t that amazing? We don’t have to worry about going and closing the file manually.

Alright so we have our file ready and as we hit save, go will add a few things within your code.

package mainimport (
"log"
"os"
)
func main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
}

Go import the necessary packages. This was really helpful.

The next thing we need is a CSV writer. Whatever data we are fetching from the website, we are going to write it into a CSV file. For that, we need to have a writer.

package mainimport (
"encoding/csv"
"log"
"os"
)
func main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
writer := csv.NewWriter(file)}

After adding a writer and saving it, go will import another package and that is encoding/csv.

The next thing we do with a writer is once we are done writing the file, we throw everything from the buffer into the writer and which can later be passed onto the file. For that, we will use Flush.

package mainimport (
"encoding/csv"
"log"
"os"
)
func main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
}

But again this process has to be performed afterward and not right away. So, we can add in the keyword defer.

So, now we have our file structures and a writer ready. Now, we can get our hands dirty with web scraping.

So, we will start with instantiating what is a collector.

package mainimport (
"github.com/gocolly/colly"
"encoding/csv"
"log"
"os"
)
func main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com")
)
}

Go has also imported colly for us. We have also specified what domains we are working with. We will scrape Internshala (It provides a platform for companies to post internships).

The next thing we need to do is we need to point to the web page from where we will fetch the data from. Here is how we are going to do that. We will fetch internships from this page.

We are interested in what internships do we have. We will scrape every individual internship provided. If you will inspect the page you will find that internship_meta is our target tag.

package mainimport (
"github.com/gocolly/colly"
"encoding/csv"
"log"
"os"
)
func main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com")
)
c.onHTML(".internship_meta", func(e *colly.HTMLElement){
writer.Write( []string {
e.ChildText("a"),
}) })}

We have created a pointer to that HTML element and it is pointing to internship_meta tag. Using the above code we are going to write the data into our CSV file. writer function will type the slice of a string. We need to specify precisely what we need. ChildText will return concatenated and stripped text of matching elements. Inside that, we have passed a tag a to extract all the elements with tag a. We have applied a comma because we are writing a CSV file. We also need ChildText of span tag to get the stipend amount a company is offering.

So, what we have basically done is, earlier we have created a collector from colly and after that, we have pointed to the web structure and specified what we needed from the web page.

So, the next thing is we need to visit this website and fetch all the data. Also, we have to do it for all the pages. You can find the total pages at the bottom of the page. Right now I have like 330 pages on that website. We will use famous for loop here.

package mainimport (
"github.com/gocolly/colly"
"encoding/csv"
"log"
"os"
)
func main () {
fName:= "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("could not create the file, err :%q",err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com")
)
c.onHTML(".internship_meta", func(e *colly.HTMLElement){
writer.Write( []string {
e.ChildText("a"),
}) })

for i=0; i<330; i++ {
fmt.Printf("Scraping Page : %d\n",i)
c.Visit("https://internshala.com/internships/page-"+strconv.Itoa(i))
} log.Printf("Scraping Complete\n")
log.Println(c)
}

First, we have used a print statement to update us about the page being scraped. Then our script will visit the target page. Since there are 330 pages then we will insert the value of i after converting it to string to our target URL. Then we have printed the data that colly will bring from the website.

Let’s build it then. You just have to type go build on the terminal.

go build

It did nothing but created a file goscraper for us and we can execute that.

The next command to execute the file will be ./goscraper and a tab for compilation.

./goscraper

It will start scraping the pages.

I have just stopped the scraper in between because I don’t want to scrape all the pages. Now, if you will look at the file data.csv which Go has created for us. It will look like below.

That is it. Your basic go scraper is ready. If you want to make it more readable then use regex. I have also created a graph for the number of jobs vs job sectors. I leave this activity for you as homework.

Conclusion

In this tutorial, we discussed the various Golang open source libraries you may use to scrape a website. If you followed along with the tutorial, you were able to create a basic scraper to crawl a page or two. While this was an introductory article, we covered most methods you can use with the libraries. You may choose to build on this knowledge and create complex web scrapers that can crawl thousands of pages.

Feel free to message us to inquire about anything you need clarification on.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!