< Back to Blog Overview

A Practical Guide to Web Scraping for Machine Learning and Data Pre-processing

30-03-2023

We will understand the steps to perform data preprocessing which creates a base for any NLP model.

I consider this part the most boring, but we’ll learn many different machine-learning concepts while performing data preprocessing. In this article, we’ll take a somewhat professional route of collecting data. We’ll use web scraping to collect data from websites and store them in a CSV file.

So, it will be a combination of web scraping and data preprocessing. It will be a great exercise for us.

Web scraping using machine learning can extract data from websites automatically and efficiently. This technique can be used to scrape data from web pages that are constantly changing, such as search engine results in pages or social media feeds.

Web Scraping For Machine Learning

Machine learning can automatically identify and extract the desired data from a website, making web scraping more efficient and accurate.

7 Steps of Machine Learning for Web Scraping

  1. Gathering the Data
  2. Import the data and libraries
  3. Divide the dataset into dependent & independent variables.
  4. Check for missing values.
  5. Check for categorical values.
  6. Split the dataset into training and test set.
  7. Feature Scaling

Gathering the Data

As I said earlier, we are going to collect data through web scraping. We are going to scrape this website. I understand that we could have directly downloaded the CSV file and then continued to our second step, but I like to take the long view and think about how the skills learned here will help in the future.

Procedure

Generally, web scraping is divided into two parts:

  1. Fetching data by making an HTTP request
  2. Extracting important data by parsing the HTML DOM

Libraries & Tools

  1. Beautiful Soup is a Python library for pulling data out of HTML and XML files.
  2. Requests allow you to send HTTP requests very easily.
  3. Pandas provide fast, flexible, and expressive data structures.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup, requests & pandas. For creating a folder and installing libraries type the below-given commands. I am assuming that you have already installed Python 3.x.

mkdir scraper
pip install beautifulsoup4
pip install requests
pip install pandas

Now, create a file inside that folder by any name you like. I am using dataprocess.py. Then just import Beautiful Soup, requests, and pandas like below.

from bs4 import BeautifulSoup
import requests
import pandas as pd

Preparing the Food

We are going to scrape the table from this website and then we are going to store the data in a CSV file using Pandas.

r = requests.get(‘<a href="https://milindjagre.wordpress.com/2018/05/26/post-5-data-preprocessing-part-3/" target="_blank" data-type="URL" data-id="https://milindjagre.wordpress.com/2018/05/26/post-5-data-preprocessing-part-3/" rel="noreferrer noopener">https://milindjagre.co/2018/03/10/post-3-ml-data-preprocessing-part-1/</a>’).text

soup=BeautifulSoup(r,'html.parser')

u=list()
l={}

We are going to scrape this table using BeautifulSoup.

table = soup.find(“table”,{“class”:”js-csv-data csv-data js-file-line-container”}) 

tr = table.find_all(“tr”,{“class”:”js-file-line”})

We’ll run a for loop to reach each and every “td” tag.

for i in range(0,len(tr)):
 td=tr[i].find_all(“td”)
 for x in range(0,len(td)):
  try:
   l[“Country”]=td[1].text
  except:
   l[“Country”]=None 

 try:
   l[“Age”]=td[2].text
  except:
   l[“Age”]=None 

 try:
   l[“Salary”]=td[3].text
  except:
   l[“Salary”]=None 

 try:
   l[“Purchased”]=td[4].text
  except:
   l[“Purchased”]=None

 u.append(l)
  l={}

The list u looks something like this.

{
 “Data”: [
 {
 “Country”: “France”,
 “Age”: “44”,
 “Purchased”: “No”,
 “Salary”: “72000”
 },
 {
 “Country”: “France”,
 “Age”: “44”,
 “Purchased”: “No”,
 “Salary”: “72000”
 },
 {
 “Country”: “France”,
 “Age”: “44”,
 “Purchased”: “No”,
 “Salary”: “72000”
 },
 {
 “Country”: “France”,
 “Age”: “44”,
 “Purchased”: “No”,
 “Salary”: “72000”
 },
 {
 “Country”: “France”,
 “Age”: “44”,
 “Purchased”: “No”,
 “Salary”: “72000”
 },
 {
 “Country”: “Spain”,
 “Age”: “27”,
 “Purchased”: “Yes”,
 “Salary”: “48000”
 },
 {
 “Country”: “Spain”,
 “Age”: “27”,
 “Purchased”: “Yes”,
 “Salary”: “48000”
 },
 {
 “Country”: “Spain”,
 “Age”: “27”,
 “Purchased”: “Yes”,
 “Salary”: “48000”
 },
 {
 “Country”: “Spain”,
 “Age”: “27”,
 “Purchased”: “Yes”,
 “Salary”: “48000”
 },
 {
 “Country”: “Spain”,
 “Age”: “27”,
 “Purchased”: “Yes”,
 “Salary”: “48000”
 },
 {
 “Country”: “Germany”,
 “Age”: “30”,
 “Purchased”: “No”,
 “Salary”: “54000”
 },
 {
 “Country”: “Germany”,
 “Age”: “30”,
 “Purchased”: “No”,
 “Salary”: “54000”
 },
 {
 “Country”: “Germany”,
 “Age”: “30”,
 “Purchased”: “No”,
 “Salary”: “54000”
 },
 {
 “Country”: “France”,
 “Age”: “37”,
 “Purchased”: “Yes”,
 “Salary”: “67000”
 }
 ]
}

From here we have to use pandas to create a data frame of the above list to save the data into a CSV file.

df = pd.io.json.json_normalize(u)
df.to_csv(‘data.csv’, index=False, encoding=’utf-8')

This will save the data to a CSV file.

So, here we finish our data-gathering process.

Import the data and libraries

Libraries are tools that you can use to do a specific job. It makes programming very simple. You just have to provide input and the library will respond with the result you are expecting. Three essential libraries we are going to use in data preprocessing

  1. Numpy is the fundamental package for array computing with Python.
  2. Matplotlib.pyplot is a plotting package for python
  3. Pandas is a powerful data structure for data analysis, time series, and statistics. It is also very helpful in importing data.

Now, importing data is very simple. We will use pandas to import data. We are going to import the data.csv file which we created while gathering data.

datasets = pd.read_csv(‘Data.csv’)

Dependent & Independent Variables

Now, we need to distinguish the matrix of features and the dependent variable vector. So, we are going to create a matrix of features. We will make a matrix of 3 independent variables.

X = datasets.iloc[: , :-1].values

Now, by ‘:’ (on the left of ‘,’)it means that we took all the lines into consideration, and on the right of the comma it means all the columns except the last one which is Purchased or not. Now, we will make a matrix of dependent variables.

y = datasets.iloc[: , 3].values

3 on the right of a comma means the last column of the table.

Missing values

As you can see there are two missing data in the table. One in the Age column and the other in the Salary column. To solve this problem we can just remove that complete row/dataset. But that could be very dangerous if that row contains very crucial information.

So, it is not recommended to remove any of the observations. Another idea is to take the means of columns. We will use Sklearn for doing this task for us. We will use its class SimpleImputer to do the heavy lifting.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy=’mean’)
imputer = imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy=’mean’)
imputer = imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

In the first line, we are telling Simplimputer to take the mean of empty (‘nan’) values by creating the object ‘imputer’. missing_values means we are getting the empty places of the column and strategy means we are going to take the mean of the whole column.

In the second line, we are fitting that object to the datasets which are missing the data. In the third line, we are just changing the dataset X by replacing the missing values with their average. This is how X looks now.

[[‘France’ 44.0 72000.0]
 [‘Spain’ 27.0 48000.0]
 [‘Germany’ 30.0 54000.0]
 [‘Spain’ 38.0 61000.0]
 [‘Germany’ 40.0 63777.77777777778]
 [‘France’ 35.0 58000.0]
 [‘Spain’ 38.77777777777778 52000.0]
 [‘France’ 48.0 79000.0]
 [‘Germany’ 50.0 83000.0]
 [‘France’ 37.0 67000.0]]

Categorical variables

As you can see we have two categorical variables country & purchased. The country has three and Purchased has two categories. Machine Learning models are based on mathematical equations. So, it will create problems if we keep the text in equations.

Therefore, we have to encode those variables. We are going to use LabelEncoder to encode our text variables.

from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

In the first line, as usual, we are creating an object of the class LabelEncoder.In the second line, we have used fit_transform method to fit label encoder and return encoded variables. Now, the independent variable matrix X will look like this.

[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]

But, something is wrong in the matrix. Can you guess that? Well, in the first matrix, France is denoted by 0, Spain is denoted by 2, and Germany is denoted by 1. This is a situation where our machine learning model will think that France is bigger than Germany and Spain is bigger than Germany but in reality, you can’t compare these three countries. To solve this problem we are going to use dummy variables.

Dummy Variable

We’ll split the country column into three columns since we have 3 categories in it. We are going to use ColumnTransformer and OneHotEncoder to split the complete column of the country. OneHotEncoder will create a separate column for each category. By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([(“Country”, OneHotEncoder(), [0])], remainder = ‘passthrough’)
X = ct.fit_transform(X)

Now, the variable X will look like this.

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

The first column represents “France,” the second column represents “Germany,” and the third column represents “Spain.” So, if the row has France, the column will show 1 or 0. Similarly, the same concept is used for Spain and Germany.

Independent Variable

Now, we will encode variable Y.

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

Then the variable Y will look like this.

[0 1 0 0 1 1 0 1 0 1]

Be relieved as we won’t have to use onehotencoder, we just need to use labelencoder because since this is the dependent variable the machine learning model will know that it’s a category and there is no order between the two.

Split the Dataset Into Training and Test Set

We have a dataset of 10 observations. In any machine learning model, we have to separate the data into two separate sets. That is training and test sets.

The question is, why do we need to do this? Well, take a step back and focus on the word machine learning itself. This is about a machine that is going to learn something.

In our case, there is an algorithm that is going to learn something from your data to make predictions or complete machine learning goals.

We don’t want our algorithm to learn something by heart otherwise, our ML model will predict similar results even with different datasets. On the training set, we built a machine learning model and a test set on which we test the performance of the ML model.

You should also keep one thing in mind the test model should not be different from the performance of the training set. We are going to use train test_split to split our dataset. As you can see that the names of the libraries are quite intuitive.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

test_size floats between 0 and 1 and represents the proportion of the dataset to include in the test split and randon_state controls the shuffling applied to the data before applying the split.X_train is the training set of variable XX_test is the test set of variable XY_train is the training set of variable YY_test is the test set of variable Y.

Feature Scaling

If you take a look at our Age column, the values are not in the same range. It is going from 27 to 50, and the salary is going from 40k to 90k. Age and salary values are not in the same range this will cause some issues in our ML model.

That issue could be because almost all the ML models are based on euclidean distance(Go back to high school). Since the salary column has a much wider range of values from 0 to 100k, the euclidean distance will be dominated by the salary compared to the Age column.

We are going to use StandardScaler to normalize the values so that we can have a solid ML model. StandardScaler will Standardize every value. There are mainly two ways to do it Standardisation and Normalisation.

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Top Machine Learning Web Scraping Projects Resource

Here are some of the best resources where you can find machine learning web scraping projects.

  1. Kaggle
  2. GitHub
  3. Google Code
  4. Bitbucket
  5. SourceForge

Conclusion

In this article, we understood how we can scrape data using Python BeautifulSoup and then perform data preprocessing using several important libraries in Machine learning. From here you can start with Simple Linear Regression.

The first part of the Machine learning model is done. Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status