There is a lot of data available on the internet and almost all of that is pretty useful. You can make an analysis based on that data, make better decisions, and even predict changes in the stock market. But there is a gap between this data and your decision-making graphs and that gap can be filled with HTML parsing.
If you want to use this data for your personal or business needs then you have to scrape it and clean it. All this data is not human readable therefore you need a mechanism to clean that raw data and make it human readable. This technique is called HTML parsing.
In this blog, we will talk about the best python html parsing libraries available. Many new coders get confused while choosing a suitable parsing library. Python is supported by a very large community and therefore it comes with multiple options for parsing html.
Here are some common criteria and reasons for selecting specific HTML parsing libraries for this blog.
- Ease of Use and Readability
- Performance and Efficiency
- Error Handling and Robustness
- Community and Support
- Documentation and Learning Resources
Top 4 Python HTML Parsing Libraries
BeautifulSoup
It is the most popular one among all the html parsing libraries. It can help you parse HTML and XML documents with ease. Once you read the documentation you will find it very easy to create parsing trees and extract useful data out of them.
Since it is a third-party package you have to install it using pip in your project environment. You can do it using pip install beautifulsoup4
. Let’s understand how we can use it in Python with a small example.
The first step would be to import it into your Python script. Of course, you have to first scrape the data from the target website but for this blog, we are just going to focus on the parsing section. You can refer to web scraping with Python in order to learn more about the web scraping part using the best Python web scraping libraries.
Example
Let’s say we have the following simple HTML document as a string.
Sample HTML Page
Welcome to BeautifulSoup Example
This is a paragraph of text.
- Item 1
- Item 2
- Item 3
Here’s a Python code example using BeautifulSoup.
from bs4 import BeautifulSoup
# Sample HTML content
html = """
Sample HTML Page
Welcome to BeautifulSoup Example
This is a paragraph of text.
- Item 1
- Item 2
- Item 3
"""
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'html.parser')
# Accessing Elements
print("Title of the Page:", soup.title.text) # Access the title element
print("Heading:", soup.h1.text) # Access the heading element
print("Paragraph Text:", soup.p.text) # Access the paragraph element's text
# Accessing List Items
ul = soup.ul # Access the unordered list element
items = ul.find_all('li') # Find all list items within the ul
print("List Items:")
for item in items:
print("- " + item.text)
Let me explain the code step by step:
- We import the
BeautifulSoup
class from thebs4
library and create an instance of it by passing our HTML content and the parser to use (in this case,'html.parser'
). - We access specific elements in the HTML using the BeautifulSoup object. For example, we access the title, heading (h1), and paragraph (p) elements using the
.text
attribute to extract their text content. - We access the unordered list (ul) element and then use
.find_all('li')
to find all list items (li) within it. We iterate through these list items and print their text.
Once you run this code you will get the following output.
Title of the Page: Sample HTML Page
Heading: Welcome to BeautifulSoup Example
Paragraph Text: This is a paragraph of text.
List Items:
- Item 1
- Item 2
- Item 3
You can adapt similar techniques for more complex web scraping and data extraction tasks. If you want to learn more about BeautifulSoup, you should read web scraping with BeautifulSoup.
LMXL
LXML is considered to be one of the fastest parsing libraries available. It gets regular updates with the last update released in July of 2023. Using its ElementTree API you can access libxml2
and libxslt
toolkits(for parsing HTML & XML) of C language. It has great documentation and community support.
BeautifulSoup also provides support for lxml
. You can use it by just mentioning the lxml
as your second argument inside your BeautifulSoup constructor.
lxml
can parse both HTML and XML documents with high speed and efficiency. It follows standards closely and provides excellent support for XML namespaces, XPath, and CSS selectors.
In my experience, you should always prefer BS4
when dealing with messy HTML and use lxml
when you are dealing with XML documents.
Like BeautifulSoup this is a third-party package that needs to be installed before you start using it in your script. You can simply do that by pip install lxml
.
Let me explain to you how it can used with a small example.
Example
Python Programming
Manthan Koolwal
36
Web Development with Python
John Smith
34
Our objective is to extract this text using lxml
.
from lxml import etree
# Sample XML content
xml = """
Python Programming
Manthan Koolwal
36
Web Development with Python
John Smith
34
"""
# Create an ElementTree from the XML
tree = etree.XML(xml)
# Accessing Elements
for book in tree.findall("book"):
title = book.find("title").text
author = book.find("author").text
price = book.find("price").text
print("Title:", title)
print("Author:", author)
print("Price:", price)
print("---")
Let me explain you above code step by step.
- We import the
etree
module from thelxml
library and create an instance of it by passing our XML content. - We access specific elements in the XML using the
find()
andfindall()
methods. For example, we find all<book>
elements within the<bookstore>
usingtree.findall("book")
. - Inside the loop, we access the
<title>
,<author>
, and<price>
elements within each<book>
element usingbook.find("element_name").text
.
The output will look like this.
Title: Python Programming
Author: Manthan Koolwal
Price: 36
---
Title: Web Development with Python
Author: John Smith
Price: 34
---
If you want to learn more about this library then you should definitely check out our guide Web Scraping with Xpath and Python.
html5lib
HTML5lib is another great contender on this list which works great while parsing the latest HTML5. Of course, you can parse XML as well but mainly it is used for parsing html5.
It can parse documents even when they contain missing or improperly closed tags, making it valuable for web scraping tasks where the quality of HTML varies. html5lib produces a DOM-like tree structure, allowing you to navigate and manipulate the parsed document easily, similar to how you would interact with the Document Object Model (DOM) in a web browser.
Whether you’re working with modern web pages, and HTML5 documents, or need a parsing library capable of handling the latest web standards, html5lib is a reliable choice to consider.
Again this needs to be installed before you start using it. You can simply do it by pip install html5lib
. After this step, you can directly import this library inside your Python script.
Example
import html5lib
# Sample HTML5 content
html5 = """
HTML5lib Example
Welcome to HTML5lib
This is a paragraph of text.
- Item 1
- Item 2
- Item 3
"""
# Parse the HTML5 document
parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
tree = parser.parse(html5)
# Accessing Elements
title = tree.find("title").text
heading = tree.find("h1").text
paragraph = tree.find("p").text
list_items = tree.findall("ul/li")
print("Title:", title)
print("Heading:", heading)
print("Paragraph Text:", paragraph)
print("List Items:")
for item in list_items:
print("- " + item.text)
Explanation of the code:
- We import the
html5lib
library, which provides the HTML5 parsing capabilities we need. - We define the HTML5 content as a string in the
html5
variable. - We create an HTML5 parser using
html5lib.HTMLParser
and specify the tree builder as"dom"
to create a Document Object Model (DOM)-like tree structure. - We parse the HTML5 document using the created parser, resulting in a parse tree.
- We access specific elements in the parse tree using the
find()
andfindall()
methods. For example, we find the<title>
,<h1>
,<p>
, and<ul>
elements and their text content.
Once you run this code you will get this.
Title: HTML5lib Example
Heading: Welcome to HTML5lib
Paragraph Text: This is a paragraph of text.
List Items:
- Item 1
- Item 2
- Item 3
You can refer to its documentation if you want to learn more about this library.
Pyquery
With PyQuery you can use jQuery syntax to parse XML documents. So, if you are already familiar with jQuery then pyquery will be a piece of cake for you. Behind the scenes, it is actually using lxml
for parsing and manipulation.
Its application is similar to BeautifulSoup
and lxml
. With PyQuery, you can easily navigate and manipulate documents, select specific elements, extract text or attribute values, and perform various operations on the parsed content.
This library receives regular updates and has growing community support. PyQuery supports CSS selectors, allowing you to select and manipulate elements in a document using familiar CSS selector expressions.
Example
from pyquery import PyQuery as pq
# Sample HTML content
html = """
PyQuery Example
Welcome to PyQuery
- Item 1
- Item 2
- Item 3
"""
# Create a PyQuery object
doc = pq(html)
# Accessing Elements
title = doc("title").text()
heading = doc("h1").text()
list_items = doc("ul li")
print("Title:", title)
print("Heading:", heading)
print("List Items:")
for item in list_items:
print("- " + pq(item).text())
Understand the above code:
- We import the
PyQuery
class from thepyquery
library. - We define the HTML content as a string in the
html
variable. - We create a PyQuery object
doc
by passing the HTML content. - We use PyQuery’s CSS selector syntax to select specific elements in the document. For example,
doc("title")
selects the<title>
element. - We extract text content from selected elements using the
text()
method.
Once you run this code you will get this.
Title: PyQuery Example
Heading: Welcome to PyQuery
List Items:
- Item 1
- Item 2
- Item 3
Conclusion
I hope things are pretty clear now. You have multiple options for parsing but if you dig deeper you will realize very few options can be used in production. If you want to mass-scrape some websites then Beautifulsoup
should be your go-to choice and if you want to parse XML then lxml
should your choice.
Of course, the list does not end here there are other options like
requests-html,
Scrapy, etc. but the community support received by BeautifulSoup
and lxml
is next level.
You should also try these libraries on a live website. Scrape some websites and use one of these libraries to parse out the data to make your own conclusion. If you want to crawl a complete website then Scrapy is a great choice. We have also explained web crawling in Python, it’s a great tutorial you should read it.
I hope you like this tutorial and if you do then please do not forget to share it with your friends and on your social media.