< Back to Blog Overview

Python Web Scraping Authentication: Behind the OAuth Wall

24-09-2020

There are a number of different ways to authenticate when scraping data from websites. One popular method is using OAuth, which is an open standard for authorization.

In this post, we are going to learn python web scraping authentication. Using python we are going to scrape LinkedIn using session. This is a great source for public data for lead generation, sentiment analysis, jobs, etc. We will code a scraper by which we would be able to scrape person profiles, jobs, company profiles, etc.

Web Scraping Behind Authentication with Python
Web Scraping Behind Authentication with Python

Authenticating with Python

In the world of online security, authentication is key. Whether you’re logging in to a website or accessing an API, you need to be sure that your credentials are secure.

Python offers a number of ways to handle authentication, from basic username/password checking to more complex methods like OAuth.

Username/Password Authentication

The most basic form of authentication is the username/password combination. This is the most common method for logging into websites, and it can also be used to access APIs.

When using this method, you’ll need to store your credentials in a file or database. The safest way to do this is to encrypt your password with a tool like bcrypt.

Once you have your encrypted password, you can check it against the user’s input when they try to log in. If the password matches, then the user is authenticated.

OAuth

OAuth is a popular authentication protocol that allows users to approve third-party applications to access their data. This is the protocol that’s used by sites like Facebook and Twitter to allow users to log in with their existing accounts.

If you’re building an application that needs to access data from another service, then OAuth is a good option. It’s more secure than storing credentials in a file, and it’s less likely to cause problems if your application is compromised.

To use OAuth, you’ll need to register your application with the service that you’re trying to access. This will give you a set of keys that you can use to generate access tokens.

Once you have an access token, you can use it to make requests to the API on behalf of the user. This process is known as “OAuth authentication.”

SSL/TLS

When you’re authenticating with Python, it’s important to use a secure connection. The best way to do this is to use SSL/TLS.

SSL Certificate (Secure Sockets Layer) is a protocol that encrypts data when it’s in transit. This means that your credentials will be safe if they’re intercepted by a third party.

TLS (Transport Layer Security) is the successor to SSL. It’s more secure, and it’s the protocol that’s used by most web services today.

To use SSL/TLS, you’ll need to generate a set of keys and certificates. Once you have these, you can configure your web server to use SSL/TLS.

Authentication is an important part of online security. Python offers a number of ways to handle authentication, from basic username/password checking to more complex methods like OAuth.

 SSL/TLS is the best way to secure your connection. When you’re authenticating with Python, make sure to use a secure connection by configuring your web server to use SSL/TLS.

Python Web Scraping Authentication: Requirements

Generally, web scraping is divided into two parts:

  1. Fetching data by making an HTTP request
  2. Extracting important data by parsing the HTML DOM

Libraries & Tools

In order to log in and authenticate with beautifulsoup, you will first need to create a LoginManager instance. Then, you can add your login credentials to the manager using the add_password() method. Finally, you can call the login() method to attempt to log in and authenticate with the site.

  • Requests allow you to send HTTP requests very easily.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.

mkdir scraper<br>pip install beautifulsoup4<br>pip install requests

Now, create a file inside that folder by any name you like. I am using scraping.py.

Firstly, you have to sign up for a LinkedIn account. Then just import Beautiful Soup & requests in your file. like this.

from bs4 import BeautifulSoup<br>import requests

We just want to get the HTML of a profile. For this tutorial, I will choose this profile.

Session

We will use a Session an object within the request to persist the user session. The session is later used to make the requests.

All cookies will then persist within the session for each forthcoming request. Meaning, if we sign in, the session will remember us and use that for all future requests we make.

client = requests.Session()

Preparing the Food

Now, we have all the ingredients in place to build a scraper. So, let’s start cooking. Let’s just open the developer tools, go to the Network tab and log in so we can catch the URL.

Developers Tool
email = "******<a href="https://www.scrapingdog.com/cdn-cgi/l/email-protection.html#4735262c322b37352222337575727207202a262e2b6924282a">@*</a>****"<br>password = "******"<br>HOMEPAGE_URL = '<a href="https://www.linkedin.com%27/">https://www.linkedin.com'</a><br>LOGIN_URL = '<a href="https://www.linkedin.com/checkpoint/lg/login-submit">https://www.linkedin.com/checkpoint/lg/login-submit</a>'

Paste your own email and password. LOGIN_URL could be https://www.linkedin.com/checkpoint/lg/login-submit from the developer tools.

The URL is clearly shown to be https://www.linkedin.com/checkpoint/lg/login-submit, so let’s save that. This is where our first request will go.

You will notice from the developers’ tool that login also requires a CSRF token. It also takes other data too but for this tutorial, we’ll consider CSRF only.

CSRF Token

Now, the question is how to get that token. The answer to that is very straightforward. We will make an HTTP request to HOMEPAGE_URL and then we will use BeautifulSoup to extract the token out of it.

html = client.get(HOMEPAGE_URL).content<br>soup = BeautifulSoup(html, "html.parser")<br>csrf = soup.find('input', {'name': 'loginCsrfParam'}).get('value')

Now, we have received the CSRF token. The only job left is now to log in and scrape the profile.

Login

We will log in by making a POST request to “LOGIN_URL”

login_information = {
    'session_key':email,
    'session_password':password,
    'loginCsrfParam': csrf,

}

client.post(LOGIN_URL, data=login_information)

Now you are basically done with your login part. You have made the request to sign in. All other requests you make in the same script will be considered signed-in.

Scrape Profile

Richard Branson founder at Virgin Group LinkedIn Profile
Richard Branson’s (Founder at Virgin Group) LinkedIn Profile
s = client.get('<a href="https://www.linkedin.com/in/rbranson').text">https://www.linkedin.com/in/rbranson').text</a>

print (s)

Parsing

Now, the actual scraping. This guide won’t cover that. But if you want, you can read my other guide on how to scrape with Beautiful Soup. It’s very easy to just pick up where you left off here.

Conclusion

In this article, we understood how we can scrape data using session & BeautifulSoup regardless of the type of website.

Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button!

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status