< Back to Blog Overview

How to Quickly Parse HTML with Regex

19-10-2023

“Can HTML be parsed by regex?”

Well, it might sound challenging at first, but with the right guidance, parsing HTML with regex can become easy.

Whether you’re a developer aiming to extract specific content from web pages or a data enthusiast looking for efficient methods to sift through massive amounts of web data, understanding the basics of parsing HTML with regex is essential.

This blog goes deep into this technique, offering insights, examples, and best practices for those keen on mastering the art of HTML parsing using regular expressions.

parse HTML with regex
Parse HTML with Regex

What you will learn from this article?

  • How regular expressions can be used in Python?
  • How to create patterns.

I am assuming that you have already installed Python 3.x on your computer. If not then please install it from here.

Come, let us explore the art of HTML parsing using Python and Regex!

What is Regular Expression?

Regular expression or regex is like a sequence of characters that forms a search pattern that can be used for matching strings. It is a very powerful tool that can be used for text processing, data extraction, etc. It is supported by almost every language including Python, JavaScript, Java, etc. It has great community support which makes searching and matching using Regex super easy.

There are five types of Regular Expressions:

types of regular expressions
Types of regular expressions

Here is how regex can be used for data extraction

  • A sequence of characters is declared to match a pattern in the string.
  • In the above sequence of characters, metacharacters like dot . or asterisk * are also often used. Here the dot (.) metacharacter matches any single character, and the asterisk (*) metacharacter represents zero or more occurrences of the preceding character or pattern.
  • Quantifiers are also used while making the pattern. For example, the plus (+) quantifier indicates one or more occurrences of the preceding character or pattern, while the question mark (?) quantifier indicates zero or one occurrence.
  • Character classes are used in the pattern to match the exact position of the character in the text. For example, the square brackets ([]) can be used to define a character class, such as [a-z] which matches any lowercase letter.
  • Once this pattern is ready you can apply it to the HTML code you have downloaded from a website while scraping it.
  • After applying the pattern you will get a list of matching strings in Python.

Example

Let’s say we have this text.

text = "I have a cat and a catcher. The cat is cute."

Our task is to search for all occurrences of the word â€ścat” in the above-given text string.

We are going to execute this task using the re library of Python.

In this case, the pattern will be r’\bcat\b’. Let me explain the step-by-step breakdown of this pattern.

  • \b: This is a word boundary metacharacter, which matches the position between a word character (e.g., a letter or a digit) and a non-word character (e.g., a space or a punctuation mark). It ensures that we match the whole word “cat” and not part of a larger word that contains “cat”.
  • cat: This is the literal string “cat” that we want to match in the text.
  • \b: Another word boundary metacharacter, which ensures that we match the complete word “cat”. If you want to learn more about word boundaries then read this article.

Python Code

import re

text = "I have a cat and a catcher. The cat is cute."
pattern = r'\bcat\b'

matches = re.findall(pattern, text)

print(type(matches))

In this example, we used the re.findall() function from the re module in Python to find all matches of the regular expression pattern \bcat\b in the text string. The function returned a list with the matched word “cat” as the result.

The output will look like this.

['cat', 'cat']

This is just a simple example for beginners. Of course, regular expression becomes a little complex with complex HTML code. Now, let’s test our skill in parsing HTML using regex with a more complex example.

Parsing HTML with Regex

We are going to scrape a website in this section. We are going to download HTML code from the target website and then parse data out of it completely using Regex.

For this example, I am going to use this website. We will use two third-party libraries of Python to execute this task.

  • requests– Using this library we will make an HTTP connection with the target page. It will help us to download HTML code from the page.
  • re– Using this library we can apply regular expression patterns to the string.

What are we going to scrape?

It is always better to decide in advance what exactly we want to scrape from the website.

scraping bookstoscrape website
Scraping bookstoscrape

We are going to scrape two things from this page.

  1. Title of the book
  2. Price of the book

Let’s Download the data

I will make a GET request to the target website in order to download all the HTML data from the website. For that, I will be using the requests library.

import requests
import re

l=[]
o={}

# Send a GET request to the website
target_url = 'http://books.toscrape.com/'
response = requests.get(target_url)

# Extract the HTML content from the response
html_content = response.text

Here is what we have done in the above code.

  • We first downloaded both the libraries requests and re.
  • Then empty list l and object o were declared.
  • Then the target URL was declared.
  • HTTP GET request was made using the requests library.
  • All the HTML data is stored inside the html_content variable.

Let’s parse the data with Regex

Now, we have to design a pattern through which we can extract the title and the price of the book from the HTML content. First, let’s focus on the title of the book.

Inspecting the title in source code
Inspecting the title in the source code

The title is stored inside the h3 tag. Then inside there is a a tag which holds the title. So, the title pattern should look like this.

title_pattern = r'<h3><a.*?>(.*?)<\/a><\/h3>'

I know you might be wondering how I created this pattern, right? Let me explain to you this pattern by breaking it down.

  • <h3>: This is a literal string that matches the opening <h3> tag in the HTML content.
  • <a.*?>: This part of the pattern matches the <a> tag with any additional attributes that might be present in between the opening <a> tag and the closing >. The .*? is a non-greedy quantifier that matches zero or more characters in a non-greedy (minimal) way, meaning it will match as few characters as possible.
  • (.*?): This part of the pattern uses parentheses to capture the text within the <a> tags. The .*? inside the parentheses is a non-greedy quantifier that captures any characters (except for newline) in a non-greedy (minimal) way.
  • <\/a>: This is a literal string that matches the closing </a> tag in the HTML content.
  • <\/h3>: This is a literal string that matches the closing </h3> tag in the HTML content.

So, the title_pattern is designed to match the entire HTML element for the book title, including the opening and closing <h3> tags, the <a> tag with any attributes, and the text within the <a> tags, which represent the book title. The captured text within the parentheses (.*?) is then used to extract the actual title of the book using the re.findall() function in Python.

Now, let’s shift our focus to the price of the book.

Inspecting Price of the page in the source code
Inspecting the Price of the page in the source code

The price is stored inside the p tag with class price_color. So, we have to create a pattern that starts with <p class=”price_color”> and ends with </p>.

price_pattern = r'<p class="price_color">(.*?)<\/p>'

This one is pretty straightforward compared to the other one. But let me again break it down for you.

  • <p class="price_color">: This is a literal string that matches the opening <p> tag with the attribute class="price_color", which represents the HTML element that contains the book price.
  • (.*?): This part of the pattern uses parentheses to capture the text within the <p> tags. The .*? inside the parentheses is a non-greedy quantifier that captures any characters (except for newline) in a non-greedy (minimal) way.
  • <\/p>: This is a literal string that matches the closing </p> tag in the HTML content.

So, the price_pattern is designed to match the entire HTML element for the book price, including the opening <p> tag with the class="price_color" attribute, the text within the <p> tags, which represent the book price, and the closing </p> tag. The captured text within the parentheses (.*?) is then used to extract the actual price of the book using the re.findall() function in Python.

import requests
import re

l=[]
o={}

# Send a GET request to the website
target_url = 'http://books.toscrape.com/'
response = requests.get(target_url)

# Extract the HTML content from the response
html_content = response.text

# Define regular expression patterns for title and price
title_pattern = r'<h3><a.*?>(.*?)<\/a><\/h3>'
price_pattern = r'<p class="price_color">(.*?)<\/p>'

# Find all matches of title and price patterns in the HTML content
titles = re.findall(title_pattern, html_content)
prices = re.findall(price_pattern, html_content)

Since titles and price variables are lists, we have to run a for loop to extract the corresponding titles and prices and store them inside a list l.

for i in range(len(titles)):
    o["Title"]=titles[i]
    o["Price"]=prices[i]
    l.append(o)
    o={}


print(l)

This way we will get all the prices and titles of all the books present on the page.

Complete Code

You can scrape many more things like ratings, product URLs, etc using regex. But for the current scenario, the code will look like this.

import requests
import re

l=[]
o={}

# Send a GET request to the website
target_url = 'http://books.toscrape.com/'
response = requests.get(target_url)

# Extract the HTML content from the response
html_content = response.text

# Define regular expression patterns for title and price
title_pattern = r'<h3><a.*?>(.*?)<\/a><\/h3>'
price_pattern = r'<p class="price_color">(.*?)<\/p>'

# Find all matches of title and price patterns in the HTML content
titles = re.findall(title_pattern, html_content)
prices = re.findall(price_pattern, html_content)


for i in range(len(titles)):
    o["Title"]=titles[i]
    o["Price"]=prices[i]
    l.append(o)
    o={}


print(l)

Conclusion

In this guide, we’ve demystified the process of utilizing regex patterns to efficiently parse intricate HTML content, bypassing the need for dedicated libraries like Beautiful Soup or lxml. For newcomers, regular expressions may initially seem daunting, but with consistent practice, their power and flexibility become unmistakable.

Regular expressions stand as a potent tool, especially when dealing with multifaceted data structures. Our previous article on web scraping Amazon using Python showcased the use of regex in extracting product images, offering further insights into the versatility of this method. For a deeper dive and more real-world examples, I recommend giving it a read.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Additional Resources

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status