Scrape data from Linkedin using Python and save it in a CSV file
2020-06-13
In this post, we are going to scrape data from Linkedin using Python and a Web Scraping Tool. We are going to extract Company Name, Website, Industry, Company Size, Number of employees, Headquarters Address, and Specialties.
Why this tool?This tool will help us to scrape dynamic
websites using millions of rotating residential proxies so that we don’t get blocked. It also provides a captcha clearing facility.
Procedure
Generally, web scraping is divided into two parts:
Fetching data by making an HTTP request
Extracting important data by parsing the HTML DOM
Libraries & Tools
Beautiful Soupis a Python library for pulling data out of HTML
and XML files.
Requests allow you to send HTTP requests very easily.
Pandasprovidefast,
flexible, and expressive data structures
Web Scraperto extract the HTML code of the target URL.
Setup
Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.
Now, create a file inside that folder by any name you like. I am using scraping.py.
Firstly, you have to sign up for Web Scraper. It will provide you with 1000 FREE credits. Then just import Beautiful Soup & requests in your file. like this.
from bs4 import BeautifulSoup
import requests
import pandas as pd
What we are going to scrape
We are going to scrape the “about” page of Google from Linkedin.
Preparing the Food
Now, since we have all the ingredients to prepare the scraper, we should make a GET request to the target URL to get the raw HTML data. If you are not familiar with the scraping tool, I would urge you to go through its documentation. We will use requests to make an HTTP GET request. Now Since we are scraping a company page so I have set “type” as company
and “linkId” as google/about/. LinkId can be found in Linkedin's target URL.
Now, we will focus on extracting website, Industry, Company Size, Headquarters(Address), Type, and Specialties.
All of the above properties (except Company Size)are stored in class “org-page-details__definition-text t-14 t-black — light t-normal”
with tag dd. I will again use variable soup to extract all the properties.
In this article, we understood how we can scrape data from Linkedin using proxy scraper& Python. As
I said earlier you can scrape a Profile too but just read the docs before trying it.
Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍
Additional Resources
And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey: