In this article, we will look at some of the best Python web scraping libraries out there. Web scraping is the process or technique used for extracting data from websites across the internet.
Other synonyms for web scraping are web crawling or web extraction. Itâs a simple process with a website URL as the initial target. Web Scraping with Python is widely used in many different fields.

Python is a general-purpose language. It has many uses ranging from web development, AI, machine learning, and much more. You can perform Python web scraping by taking advantage of some libraries and tools available on the internet.
We will discuss the tools: Beautiful Soup, Requests, Selenium, and Scrapy. A web scraper written in Python 3 could be used to collect data from websites
The following are the prerequisites you will need to follow along with this tutorial:
â Installation of the latest version of Python.
â Install pip â Python package manager.
â A code editor of your choice.
Once youâve checked with the prerequisites above, create a project directory and navigate into the directory. Open your terminal and run the commands below.
mkdir python_scraper cd python_scraper
4 Python Web Scraping Libraries & Basic Scraping with Each
There are a number of great web scraping tools available that can make your life much easier. Here’s the list of top Python web scraping libraries that we choose to scrape:
- BeautifulSoup: This is a Python library used to parse HTML and XML documents.
- Requests: Best to make HTTP requests.
- Selenium: Used to automate web browser interactions.
- Scrapy Python: This is a Python framework used to build web crawlers.
Letâs get started.
1. Beautiful Soup
Beautiful Soup is one of the best Python libraries for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It is also used to extract data from some JavaScript-based web pages.
Open your terminal and run the command below:
pip install beautifulsoup4
With Beautiful Soup installed, create a new Python file, name it beautiful_soup.py
We are going to scrape (Books to Scrape)[https://books.toscrape.com/] website for demonstration purposes. The Books to Scrape website looks like this:

We want to extract the titles of each book and display them on the terminal. The first step in scraping a website is understanding its HTML layout. In this case, you can view the HTML layout of this page by right-clicking on the page, above the first book in the list. Then click Inspect.
Below is a screenshot showing the inspected HTML elements.

You can see that the list is inside the <ol class=ârowâ> element. The next direct child is the <li> element.
What we want is the book title, which is inside the <a>, inside the <h3>, inside the <article>, and finally inside the <li> element.
To scrape and get the book title, letâs create a new Python file and call it beautiful_soup.py
When done, add the following code to the beautiful_soup.py file:
from urllib.request import urlopen from bs4 import BeautifulSoup url_to_scrape = âhttps://books.toscrape.com/" request_page = urlopen(url_to_scrape) page_html = request_page.read() request_page.close() html_soup = BeautifulSoup(page_html, âhtml.parserâ) # get book title for data in html_soup.select(âolâ): for title in data.find_all(âaâ): print(title.get_text())
In the above code snippet, we open our webpage with the help of the urlopen() method. The read() method reads the whole page and assigns the contents to the page_html variable. We then parse the page using html.parser to help us understand HTML code in a nested fashion.
Next, we use the select() method provided by the BS4 library to get the <ol class=â rowâ> element. We loop through the HTML elements inside the <ol class=ârowâ> element to get the <a> tags which contain the book names. Finally, we print out each text inside the <a> tags on every loop it runs with the help of the get_text() method.
You can execute the file using the terminal by running the command below.
python beautiful_soup.py
This should display something like this:

Now letâs get the prices of the books too.

The price of the book is inside a <p> tag, inside a <div> tag. As you can see there is more than one <p> tag and more than one <div> tag. To get the right element with the book price, we will use CSS class selectors; lucky for us; each class is unique for each tag.
Below is the code snippet to get the prices of each book; add it at the bottom of the file:
# get book prices for price in html_soup.find_all(âpâ, class_=âprice_colorâ): print( price.get_text())
If you run the code on the terminal, you will see something like this:

Your completed code should look like this:
from urllib.request import urlopen from bs4 import BeautifulSoup url_to_scrape = âhttps://books.toscrape.com/" request_page = urlopen(url_to_scrape) page_html = request_page.read() request_page.close() html_soup = BeautifulSoup(page_html, âhtml.parserâ) # get book title for data in html_soup.select(âolâ): for a in data.find_all(âaâ): print(a.get_text()) # get book prices for price in html_soup.find_all(âpâ, class_=âprice_colorâ): print(price.get_text())
Youâve just noticed how easy this is if you’ve reached here. Letâs move on to the next library.
2. Requests
Requests is an elegant HTTP library. It allows you to send HTTP requests without the need to add query strings to your URLs.
To use the requests library, we first need to install it. Open your terminal and run the command below
pip3 install requests_html
Once you have installed it, create a new Python file for the code. We will prevent naming a file with reserved keywords such as requests. Letâs name the file
requests_scrape.py
Now add the code below inside the created file:
from requests_html import HTMLSession session = HTMLSession() r= session.get(<strong>âhttps://books.toscrape.com/'</strong>) get_books = r.html.find(<strong>â.rowâ</strong>)[2] # get book title for title in get_books.find(<strong>âh3â</strong>): print(title.text) # get book prices for price in get_books.find(<strong>â.price_colorâ</strong>): print(price.text)
In this code snippet. In the first line, we imported HTMLSession from the request_html library. And instantiated it. We use the session to perform a get request from the BooksToScrape URL.
After performing the get request. We get the unicorn representation of HTML content from our BooksToScrape website. From the HTML content, we get the class row. Located at index 2 contains the list of books and is assigned to the get_books variable.
We want the book title. Like in the first example, the book title is inside the <a>, inside the <h3>. We loop through the HTML content to find each <h3> element and print the title as text.
To get the prices of each book, we only change what element the find method should search for in the HTML content. Luckily, the price is inside a <p> with a unique class price_color thatâs not anywhere else. We loop through the HTML content and print out the text content of each <p> tag.
Execute the code by running the following command in your terminal:
python requests_scrape.py
Below is the output of the book titles:

Below is the output of the book prices:

You can visit Requests HTML scraping with python to learn more about many things you can do with it.
3. Selenium
Selenium is a web-based automation tool. Its primary purpose is for testing web applications, but it can still do well in web scraping.
We are going to import various tools to help us in scraping.
First, we are going to install selenium. There are several ways to install it:
â You can install using pip with the command:
pip install selenium
â You can also install using Conda with the command:
conda install âc conda âforge selenium
â Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py with the command below:
python setup.py install
We will be using the chrome browser, and for this, we need the chrome web driver to work with Selenium.
Download chrome web driver using either of the following methods:
1. You can either download it directly from the link below
Chrome driver download link You will find several download options on the page depending on your version of Chrome. To locate what version of Chrome you have, click on the three vertical dots at the top right corner of your browser window, and click âHelpâ from the menu. On the page that opens, select âAbout Google Chrome.â
The screenshot below illustrates how to go about it:

After clicking, you will see your version. I have version 92.0.4515.107, shown in the screenshots below:

2. Or by running the commands below, if you are on a Linux machine:
wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip unzip chromedriver_linux64.zip
After installing. You need to know where you saved your web driver download on your local computer. This will help us get the path to the web driver. Mine is in my home directory.
To get the path to the web driver. Open your terminal and drag the downloaded Chrome driver right into the terminal. An output of the web driver path will be displayed.
When youâre done, create a new Python file; letâs call it selenium_scrape.py.
Add the following code to the file:
from selenium import webdriver from selenium.webdriver.common.by import By url = âhttps://books.toscrape.com/' driver = webdriver.Chrome(â/home/marvin/chromedriverâ) driver.get(url) container = driver.find_element_by_xpath (â//[@id=âdefaultâ]/div/div/div/div/section/div[2]/olâ) # get book titles titles = container.find_elements(By.TAG_NAME, âaâ) for title in titles: print(title.text)
We first import a web driver from Selenium to control Chrome in the above code. Selenium requires a driver to interface with a chosen browser.
We then specify the driver we want to use, which is Chrome. It takes the path to the Chrome driver and goes to the site URL. Because we have not launched the browser in headless mode. The browser appears, and we can see what it is doing.
The variable container contains the XPath of the <a> tag with the book title. Selenium provides methods for locating elements, tags, class names, and more. You can read more from selenium location elements
To get the XPath of <a> tag. Inspect the elements, find the <a> tag with the book title, and right-click on it. A dropdown menu will appear; select Copy, then select Copy XPath.
Just as shown below:

From the variable container. We can then find the titles by the tag name <a> and loop through to print all titles in the form of text.
The output will be as shown below:

Now, letâs change the file to get book prices by adding the following code after the get book titles code.
prices = container.find_elements(By.CLASS_NAME, <strong>âprice_colorâ</strong>) for price in prices: print(price.text)
In this code snippet. We get the prices of each book using the class name of the book price element. And loop through to print all prices in the form of text. The output will be like the screenshot below:

Next, we want to access more data by clicking the next button and collecting the other books from other pages.
Change the file to resemble the one below:
from selenium import webdriver from selenium.webdriver.common.by import By url = âhttps://books.toscrape.com/' driver = webdriver.Chrome(â/home/marvin/chromedriverâ) driver.get(url) def get_books_info(): container =driver.find_element_by_xpath(â//[@id=âdefaultâ]/div/div/div/div/section/div[2]/olâ) titles = container.find_elements(By.TAG_NAME, âaâ) for title in titles: print(title.text) prices = container.find_elements(By.CLASS_NAME, âprice_colorâ) for price in prices: print(price.text) next_page = driver.find_element_by_link_text(ânextâ) next_page.click() for x in range(5): get_books_info() driver.quit()
We have created the get_books_info function. It will run several times to scrape data from some pages, in this case, 5 times.
We then use the element_by_link_text() method. to get the text of the <a> element containing the link to the next page.
Next, we add a click function to take us to the next page. We scrape data and print it out on the console; we repeat this 5 times because of the range function. After 5 successful data scrapes, the driver.quit() method closes the browser.
You can choose a way of storing the data either as a JSON file or in a CSV file. This is a task for you to do in your spare time.
You can dive deeper into selenium and get creative with it. I have a separate detailed guide on web scraping with Selenium & Python, do check out it too!!
4. Scrapy
Scrapy is a powerful multipurpose tool used to scrape the web and crawl the web. Web crawling involves collecting URLs of websites plus all the links associated with the websites. Finally, store them in a structured format on servers.
Scrapy provides many features but is not limited to:
â Selecting and extracting data from CSS selectors
â Support for HTTP, crawl depth restriction, and user-agent spoofing features,
â Storage of structured data in various formats such as JSON, Marshal, CSV, Pickle, and XML.
Letâs dive into Scrapy. We need to make sure we have scrapy installed; install it by running the command below:
sudo apt-get update sudo apt install python3-scrapy
We will have to set up a new Scrapy project. Enter a directory where youâd like to store your code and run the command below:
scrapy startproject tutorial cd tutorial
This will create a tutorial directory with the following contents:
spiders/ __init__.py <em># a directory where youâll later put your spiders</em> tutorial/scrapy.cfg <em># deploy configuration file</em> tutorial/__init__.py <em># projectâs Python module, youâll import your code from here</em> items.py <em># project items definition file</em> Middlewares.py <em># project middlewares file</em> pipelines.py <em># project pipelines file</em> settings.py <em># project settings file</em>
The screenshot below shows the project structure:

Before we add code to our created project. The best way to learn how to extract data with Scrapy is by using the Scrapy Shell.
Scraping using the Scrapy Shell
The shell comes in handy. Because it quickens debugging of our code when scrapping, without the need to run the spider. To run Scrapy shell, you can use the shell command below:
scrapy shell <url>
On your terminal, run :
scrapy shell âhttps://books.toscrape.com/â
If you donât get any data back, you can add the user agent with the command below:
scrapy shell âs USER_AGENT=â â âhttps://books.toscrape.com/â
To get USER_AGENT, open your dev tools with ctrl+shift+i. Navigate to the console, clear the console; type navigator.userAgent, then hit enter.
An example of a USER AGENT can be: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36.
The screenshot below shows how to get the name of your USER_AGENT in the dev tools:

If youâre successful in getting output from the shell command, you will see a resemblance to the one below:

Using the shell, you can try selecting elements using CSS. The shell returns a response object.
Let us get the response object containing the titles and prices of the books from our test
website Bookstoscrape. The book title is inside element <a> element, inside the <h3>, inside <article>, inside <li>, inside <ol> with a class row. And, finally inside a <div> element.
We create a variable container and assign it to the response object containing the <ol> element with a class of rows inside a <div> element.
To see the container output in our Scrapy shell, type in a container and hit enter; the output will be like below:

Now, let us find the book title of each book, using the response object we got above
Create a variable called titles; this will hold our book titles. Using the container with the response object we got above.
We will select <a> inside the <h3> element, using the CSS selectors that scrapy provides.
We then use the CSS extension provided by scrapy to get the text of the <a> element.
Finally, we use the getall() method to get all the titles. As shown below:

Run titles to get the output of all the book titles. You should see an output like the one below:

That went well.
Now, let us get prices for each book.
In the same scrapy shell, create a price variable to hold our prices. We use the same container response object. With CSS, we select <p> element with a class of price_color.
We use the CSS extension provided by scrapy to get the text from the <p> element.
Finally, the getall() method gets all the prices.
As shown below:

Run the prices; your output should look like the below:

That was quick, right? A scrapy shell saves us a lot of time debugging as it provides an interactive shell.
Now letâs try using a spider.
Scraping using spider
Letâs go back to the tutorial folder we created; we will add a spider.
A spider is what scrapy uses to scrape information from a website or a group of websites.
Create a new file. Name it books_spider.py under the tutorial/spiders directory in your project.
Add the following code to the file:
import scrapy class BooksSpider(scrapy.Spider): name = âbooksâ start_urls = [ âhttps://books.toscrape.com/' ] def parse(self, response): for book in response.css(âdiv ol.rowâ): title = book.css(âh3 a::textâ).getall() price = book.css(âp.price_color::textâ).getall() yield { âtitleâ: book.css(âh3 a::textâ).getall(), âpriceâ: book.css(âp.price_color::textâ).getall() }
The BooksSpider subclasses scapy.Spider. It has a name attribute, the name of our spider, and the start_urls attribute, which has a list of URLs.
The list with URLs will make the initial requests for the spider.
It can also define how to follow links in the pages and parse the downloaded page content to extract data.
The parse method parses the response, extracting the scraped data as dictionaries. It also finds new URLs to follow and creates new requests from them.
To get output from our code, letâs run a spider. To run a spider, you can run the command with the syntax below:
scrapy crawl <spider name>
On your terminal, run the command below:
scrapy crawl books
You will get an output resembling the one below:

We can store the extracted data in a JSON file. We can use Feed exports which scrapy provides out of the box. It supports many serialization formats, including JSON, XML, and CSV, just to name a few.
XML scraping is a process of extracting data from an XML file. This can be done manually or using a software program. Scraping data from an XML file can be a tedious process, but it is necessary in order to get the desired data.
To generate a JSON file with the scraped data, run the command below:
scrapy crawl books âo books.json
This will generate a books.json file with contents resembling the one below:

Following links with scrapy
Letâs follow the link to the next page and extract more book titles. We inspect the elements and get the link to the page we want to follow.
The link is <a> tag <li> with a class next, inside <ul> tag with class pager, finally inside a <div> tag
Below is a screenshot of the inspected element with a link to our next page:

Letâs use the scrapy shell to get the link to the next page first. Run the scrapy shell command with the books to scrape Url.
We get the href attribute to determine the specific URL the next page goes to, just like below:

Letâs now use our spider, and modify the books_spider.py file to repeatedly follow the link to the next page, extracting data from each page.
import scrapy class BooksSpider(scrapy.Spider): name = âbooksâ start_urls = [ âhttps://books.toscrape.com/' ] def parse(self, response): for book in response.css(âdiv ol.rowâ): title = book.css(âh3 a::textâ).getall() price = book.css(âp.price_color::textâ).getall() yield { âtitleâ: book.css(âh3 a::textâ).getall(), âpriceâ: book.css(âp.price_color::textâ).getall() } next_page = response.css(âli.next a::attr(href)â).get() if next_page is not None: yield response.follow(next_page, callback=self.parse)
In this code snippet. We create a variable next_page that holds the URL to the next page. We then check if the link is not empty. Next, we use the response.follow method, and pass the URL and a callback; this returns a Request instance. Finally, we yield this Request.
We can go back to the terminal and extract a list of all books and titles into an allbooks.json file.
Run the command below:
scrapy crawl books âo allbooks.json
After itâs done scraping, open the newly created allbooks.json file. The output is like below:

You can do many things with scrapy, including pausing and resuming crawls and a wide range of web scraping tasks. I have made a separate guide on web scraping with scrapy, Do check it out too!
Take Away
This tutorial discussed the various Python open-source libraries for website data scraping. If you followed along to the end. You can now create from simple to more complex scrapers to crawl over an unlimited number of web pages. You can dive deeper into these libraries and hone your skills. Data is a very important part of decision-making in the world we live in today. Mastering how to collect data will place you way ahead.
The code for this tutorial is available from this GitHub Repository.
Frequently Asked Questions:
Which library is best for web scraping?
Beautiful Soup is the best library for web scraping.
Is Python best for web scraping?
Yes, Python is the best language for web scraping. Also, many web scraping tools are built using it.
Is scrapy a Python library?
Yes, Scrapy is a Python framework for scraping at large scale. It gives you all the tools you need to harvest data from websites.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey:
- Best Javascript Libraries
- Best Datacenter Proxies
- How to build a web crawler using Python
- Web scraping with XPath and Python
- How to Use A Proxy with Python Requests
Feel free to message us if you have any doubts about Python web scraping libraries.