< Back to Blog Overview

Scrape Email Addresses From Websites using Python

13-01-2023

Email Scraping has become a popular and efficient method for obtaining valuable contact information from the internet. By learning how to scrape emails, businesses and individuals can expand their networks, gather leads, and conduct market research more effectively. In this article, we will look at how to extract email addresses from websites using Python.

scrape emails using python
How to Scrape Email Addresses From Any Website using Python

In this tutorial, we will be using web scraping & will create an email scraper using Python and regular expression. Our target website for emails will be this webpage. Selenium will be used here because this website uses JavaScript to render its data.

Setting up the prerequisites

I am assuming that you have already installed Python 3.x on your machine. If not then you can download it from here. First, you need to create a folder where we will keep our scraper files.

mkdir email_scraper

After this, you have to install the necessary libraries and web drivers.

pip install selenium
pip install beautifulsoup4

Along with this, you have to install the Chromium web driver as well. This will be used by Selenium to render websites. You can download it from here. Everything required during the course of this article is installed.

You have to create a scraper file where we will write our scraper. I am naming it emails.py.

Let’s Start Scraping Emails

Let’s first write a small code to check if everything works fine. At first, your chromium-browser might run a little slower but it will work normally after a while.

from selenium import webdriver
import time
import re

PATH = 'C:\Program Files (x86)\chromedriver.exe'



l=list()
o={}

target_url = "https://www.randomlists.com/email-addresses"


driver=webdriver.Chrome(PATH)

driver.get(target_url)


time.sleep(10)

driver.close()

The code is pretty simple and to the point. Let me explain it step by step.

  • We have imported all the libraries which we have installed at the top.
  • We have declared the PATH of our Chromium browser. This is the path where the driver is installed.
  • declared the target URL.
  • Chrome instance is created using webdriver.Chrome()
  • Using .get() method we are trying to open the target URL in the browser.
  • Then time.sleep() method was used to wait for the complete website to load. In this example, we are waiting for 10 seconds for the complete rendering.
  • Finally closed the browser using .close() method.

The following is an example of what the chrome window would appear as upon a successful execution:

We will use regular expressions to identify the email and scrape it. If you are new to regular expression then read Web Scraping with Python and Regular Expression to get a better understanding of regular expressions.

Regular expressions are a powerful tool for identifying patterns within the text, similar to using the “Find” function in a word processing document, but with much greater capabilities. Regular expressions are extremely useful for validating user input and, particularly, for web scraping. They have a wide range of applications.

Let’s write the code.

from selenium import webdriver
import time
import re

PATH = 'C:\Program Files (x86)\chromedriver.exe'



l=list()
o={}

target_url = "https://www.randomlists.com/email-addresses"


driver=webdriver.Chrome(PATH)

driver.get(target_url)

email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)

time.sleep(5)
print(emails)
driver.close()

We have added just three lines in the above code. Let me explain them step by step.

email_pattern variable is a regular expression that will help us to identify emails on the web page. Now, the expression is pretty straightforward but let me explain it to you.

  1. The first square bracket signifies one or more characters that are uppercase or lowercase letters, digits, period, underscore, percent, plus, or hyphen, followed by an “@” symbol.
  2. The second part shows one or more characters that are uppercase or lowercase letters, digits, periods, or hyphens, followed by a “.” symbol.
  3. The last curly bracket signifies two to four characters that are uppercase or lowercase letters.

After this, we used driver.page_source to get the raw data from the website. Then we used findall() method to get all the matching patterns in the string as a list of strings. The search for matches within the string proceeds from left to right and the matches are returned in the order in which they were found.

Once you run this code you will get the output like this.

['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

Conclusion

In this tutorial, we learn about the efficient application of regular expressions to find emails using Python. With just a few more changes you can scrape emails from any website. You just have to change the target URL.

For collecting leads, Google is also a good source. You can collect emails from Google as well by making an appropriate query. Of course, you will need an API for web scraping to extract data from Google at scale as it will block you in no time.

However, I have a tutorial dedicated made for Scraping Google search results with Python here. Check it Out!!!

Forget about getting blocked while scraping the Web

Try out Scrapingdog Web Scraping API & Scrape Emails From Any Website

Additional Resources

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status