< Back to Blog Overview

Scraping Data behind Authentication with Python

2020-09-24

 

Image for post
Photo by Franki Chamaki on Unsplash

In this post, we are going to learn scraping data behind authentication with python. Using python we are going to scrape LinkedIn using session. This is a great source for public data for lead generation, sentiment analysis, jobs, etc. We will code a scraper for that. Using that scraper you would be able to scrape person profiles, jobs, company profiles, etc.

Requirements

Generally, web scraping is divided into two parts:

Libraries & Tools

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.

mkdir scraper
pip install beautifulsoup4
pip install requests

Now, create a file inside that folder by any name you like. I am using scraping.py.

Firstly, you have to sign up for a LinkedIn account. Then just import Beautiful Soup & requests in your file. like this.

from bs4 import BeautifulSoup
import requests

We just want to get the HTML of a profile. For this tutorial, I will choose this profile.

Session

We will use a Session an object within the request to persist the user session. The session is later used to make the requests.

All cookies will then persist within the session for each forthcoming request. Meaning, if we sign in, the session will remember us and use that for all future requests we make.

client = requests.Session()

Preparing the Food

Now, we have all the ingredients in place to build a scraper. So, let’s start cooking. Let’s just open the developer tools, go to the Network tab and log in so we can catch the URL.

Image for post
Developers Tool
email = "******@*****"
password = "******"
HOMEPAGE_URL = 'https://www.linkedin.com'
LOGIN_URL = 'https://www.linkedin.com/checkpoint/lg/login-submit'

Paste your own email and password. LOGIN_URL could be https://www.linkedin.com/checkpoint/lg/login-submit from the developer tools.

Image for post

The URL is clearly shown to be https://www.linkedin.com/checkpoint/lg/login-submit, so let’s save that. This is where our first request will go.

You will notice from the developers' tool that login also requires a CSRF token. It also takes other data too but for this tutorial, we’ll consider CSRF only.

Image for post
CSRF Token

Now, the question is how to get that token. The answer to that is very straightforward. We will make an HTTP request to HOMEPAGE_URL and then we will use BeautifulSoup to extract the token out of it.

html = client.get(HOMEPAGE_URL).content
soup = BeautifulSoup(html, "html.parser")
csrf = soup.find('input', {'name': 'loginCsrfParam'}).get('value')

Now, we have received the CSRF token. The only job left is now to log in and scrape the profile.

Login

We will log in by making a POST request to “LOGIN_URL

login_information = {
'session_key':email,
'session_password':password,
'loginCsrfParam': csrf,
}client.post(LOGIN_URL, data=login_information)

Now you are basically done with your log in part. You have made the request to sign in. All other requests you make in the same script will be considered signed in.

Scrape Profile

Image for post
s = client.get('https://www.linkedin.com/in/rbranson').textprint (s)

Parsing

Now, the actual scraping. This guide won’t cover that. But if you want, you can read my other guide on how to scrape with Beautiful Soup. It’s very easy to just pick up where you left off here.

Conclusion

In this article, we understood how we can scrape data using session & BeautifulSoup regardless of the type of website.

Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!