BIG NEWS: Scrapingdog is collaborating with Serpdog.

BeautifulSoup Tutorial: Scraping Web Pages With Python

Table of Contents

Web Scraping is incomplete without data extraction from the raw HTML or XML you get from the target website.

When it comes to web scraping then Python is the most popular choice among programmers because it has great community support and along with that, it is very easy to code. It is readable as well without those semicolons and curly braces.

Python also comes with a tremendous amount of libraries that help in different aspects of any web application. One of them is Beautiful Soup, which is mainly used in web scraping projects. Let’s understand what it is and how it works.

What is BeautifulSoup?

Beautiful Soup is a Python library that was named after Lewis Carroll’s poem of the same name in “Alice’s Adventures in the Wonderland”. It is also known as BS4. Basically, BS4 is used to navigate and extract data from any HTML and XML documents.

Since it is not a standard Python library you have to install it in order to use it in your project. I am assuming that you have already installed Python on your machine.

pip install beautifulsoup4

It is now installed let’s test it with a small example. For this example, we will use the requests library of Python to make an HTTP GET request to the host website. In this case, we will use Scrapingdog as our target page. In place of this, you can select any web page you like.

Getting the HTML

It’s time to use BS4, let’s make a GET request to our target website to get the HTML.

Our aim would be to get the title of the website.

from bs4 import BeautifulSoup
import requests
target_url = "https://www.scrapingdog.com/"
resp = requests.get(target_url)
print(resp.text)

Output of the above script will look like this.

<!DOCTYPE html>
<html lang="en">
<head>....

We got the complete HTML data from our target website.

Parsing HTML with BeautifulSoup

Now, we have the raw HTML data. This is where BS4 comes into action. We will kick start this tutorial by first scraping the title of the page and then by extracting all the URLs present on the page.

Get the Title

from bs4 import BeautifulSoup
import requests
target_url = "https://www.scrapingdog.com/"
resp = requests.get(target_url)

soup = BeautifulSoup(resp.text, 'html.parser')
print(soup.title)

Output of this script will look like this.

<title>Web Scraping API | Reliable &amp; Fast</title>

Let’s try one more example to better understand how BS4 actually works. This time let’s scrape all the URLs available on our target page.

Getting all the Urls

Here we will use .get() method provided by BS4 for extracting value of any attribute.

from bs4 import BeautifulSoup
import requests
target_url = "https://www.scrapingdog.com/"
resp = requests.get(target_url)
soup = BeautifulSoup(resp.text, "html.parser")

allUrls = soup.find_all("a")

for i in range(0,len(allUrls)):
    print(allUrls[i].get('href'))

Output of the script will look like this.

/
blog
pricing
documentation
https://share.hsforms.com/1ex4xYy1pTt6rrqFlRAquwQ4h1b2
https://api.scrapingdog.com/login
https://api.scrapingdog.com/register
javascript://
/
javascript://
blog
pricing
documentation
https://share.hsforms.com/1ex4xYy1pTt6rrqFlRAquwQ4h1b2
https://api.scrapingdog.com/login
https://api.scrapingdog.com/register
https://api.scrapingdog.com/register
/pricing
asynchronous-scraping-webhook
https://api.scrapingdog.com/register
https://api.scrapingdog.com/register
/documentation#python-linkedinuser
/documentation#python-google-search-api
/documentation#python-proxies
/documentation#python-screenshot
documentation
https://api.scrapingdog.com/register
documentation
https://share.hsforms.com/1ex4xYy1pTt6rrqFlRAquwQ4h1b2
tool
about
blog
faq
affiliates
documentation#proxies
documentation#proxies
terms
privacy
<a class="twitter-timeline" data-width="640" data-height="960" data-dnt="true" href="https://twitter.com/scrapingdog?ref_src=twsrc%5Etfw" target="_blank" rel="noopener">Tweets by scrapingdog</a><script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
https://www.linkedin.com/company/scrapingdog

This way you can use BS4 for extracting different data elements from any web page. I hope now you have got an idea of how this library can be used for data extraction.

BeautifulSoup — Objects

Once you pass any HTML or XML document to the Beautiful Soup constructor it converts it into multiple python objects. Here is the list of objects.

  • Comments
  • BeautifulSoup
  • Tag
  • NavigableString

Comments

You might have got an idea by the name itself. This object contains all the comments available in the HTML document. Let me show you how it works.

from bs4 import BeautifulSoup

soup=BeautifulSoup('
','html.parser') print(soup.div)

The output of the above script will look like this.

<div><!-- This is the comment section --></div>

BeautifulSoup

It is the object which we get when we scrape any web page. It is basically the complete document.

from bs4 import BeautifulSoup

soup=BeautifulSoup('
This is BS4
','html.parser') print(type(soup))
<class 'bs4.BeautifulSoup'>

Tag

A tag object is the same as the HTML or XML tags.

from bs4 import BeautifulSoup

soup=BeautifulSoup('
This is BS4
','html.parser') print(type(soup.div))
<class 'bs4.element.Tag'>

There are two features of tag as well.

  1. Name — It can be accessed with a .name suffix. It will return the type of tag.
  2. Attributes — Any tag object can have any number of attributes. “class”, “href”, “id”, etc are some famous attributes. You can even create your own custom attribute to hold any value. This can be accessed through .attrs suffix.

NavigableString

It is the content of any tag. You can access it by using .text suffix.

from bs4 import BeautifulSoup

soup = BeautifulSoup('<div>This is BS4</div>', 'html.parser')

print(type(soup.text))

<class 'str'>

How to search in a parse Tree?

There are multiple Beautifulsoup methods through which you can search any element inside a parse tree. But the most commonly used methods are find() and find_all(). Let’s understand them one by one.

What is the find() method?

Let’s say you already know that there is only one element with class ‘x’ then you can use the find() method to find that particular tag. Another example could be let’s say there are multiple classes ‘x’ and you just want the first one.

The find() method will not work if you want to get the second or third element with the same class name. I hope you got the idea. Let’s understand this by an example.

<div class="test1">test1</div><div class="q">test2</div><div class="z">test3</div>

Take a look at the above HTML code. We have three div tags with different class names. Let’s say I want to extract “test2” text. I know two things about this situation.

  1. test2 text is stored in class q
  2. class q is unique. That means there is only one class with the value q.

So, I can use the find() method here to extract the desired text.

from bs4 import BeautifulSoup

soup = BeautifulSoup('<div class="test1">test1</div><div class="q">test2</div><div class="z">test3</div>', 'html.parser')

data = soup.find("div", {"class": "q"}).text

print(data)
test2

Now let’s consider you have three div tags with the same classes.

<div class="q">test1</div><div class="q">test2</div><div class="q">test3</div>

In this condition, you cannot scrape “test2”. You can only scrape “test1” because it comes first in the parse tree. Here you will have to use the find_all() method.

What is the find_all() method?

Using the find_all() method you can extract all the elements with a particular tag. Unlike the find() method you can extract data from any tag even if it doesn’t appear on the top. It will return a list.

Consider this HTML string.

<div class="q">test1</div> <div class="q">test2</div> <div class="q">test3</div>

If I want to scrape “test2” text then I can easily do it by scraping all the classes at once since all of them have the same class names.

from bs4 import BeautifulSoup<br><br>

soup = BeautifulSoup('<div class="q">test1</div><div class="q">test2</div><div class="q">test3</div>', 'html.parser')<br><br>

data = soup.find_all("div", {"class": "q"})[1].text<br><br>

print(data)
test2

To get the second element from the list I have used [1]. This is how find_all() method works.

How to Modify a Tree?

This is the most interesting part. Beautiful Soup allows you to make changes to the parse tree according to your own requirements. Using attributes we can make changes to the tag’s property. Using .new_string(), .new_tag(), .insert_before(), etc methods we can add new tags and strings to an existing tag.

Let’s understand this with a small example.

<div class="q">test1</div>

I want to change the name of the tag and the class.

from bs4 import BeautifulSoup

soup = BeautifulSoup('<div class="q">test1</div>', 'html.parser')

data = soup.div

data.name = 'BeautifulSoup'
data['class'] = 'q2'

print(data)
<BeautifulSoup class="q2">test1</BeautifulSoup>

We can even delete the attribute like this.

from bs4 import BeautifulSoup

soup = BeautifulSoup('<div class="q">test1</div>', 'html.parser')

data = soup.div

data.name = 'BeautifulSoup'
del data['class']

print(data)
<BeautifulSoup>test1</BeautifulSoup>

This is just an example of how you can make changes to the parse tree using various methods provided by BS4.

Points to remember while using Beautiful Soup

  • As you know written HTML or XML documents use a specific encoding like ASCII or UTF-8 but when you load that document into BeautifulSoup it is converted to Unicode. BS4 by default uses a library unicode for this purpose.
  • While using BeautifulSoup you might face two kinds of errors. One is AttributeError and the other is the KeyError. AttributeError occurs when a sibling tag to the current element is absent and KeyError occurs when the HTML tag is missing.
  • You can use diagnose() function to analyze what BS4 does to our document. It will show how different parsers handle the document.

Conclusion

Beautiful Soup is a smart library. It makes our parsing job quite simple. Obviously, you can use it in more innovative ways than shown in this tutorial. While using it in a live project do remember to use try/except statements in order to avoid any production crash. You can lxml as well in place of BS4 for data extraction but personally, I like Beautiful Soup more due to vast community support.

I hope now you have a good idea of the difference between the two. Please do share this blog on your social media platforms. Let me know if you have any scraping-related queries. I would be happy to help you out.

Additional Resources

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked
My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

Scrape Google Maps using Python

Scrape Google Maps Data using Python

Best Python Web Scraping Libraries

4 Best Python Libraries for Efficient Web Scraping (Updated)