< Back to Blog Overview

How to Scrape Email Addresses from a Website using Python

13-01-2023

If you work in a lead generation or marketing section of your company then you are well aware of the importance of emails. Mainly email addresses are used for cold emailing to acquire more clients for your product.

If you want to reduce the cost of customer acquisition then you can scrape emails by creating your own email scraper rather than buying a subscription to an email finder tool.

How To Extract Email Address from Web Scraping
How To Extract Email Address from Web Scraping

In this tutorial, we will be using web scraping & create an email scraper using Python and regular expression. Our target website for emails will be this webpageSelenium will be used here because this website uses javascript to render its data.

Setting up the prerequisites

I am assuming that you have already installed python 3.x on your machine. If not then you can download it from here. First, you need to create a folder where we will keep our scraper files.

mkdir email_scraper

After this, you have to install the necessary libraries and web drivers.

pip install selenium
pip install beautifulsoup4

Along with this, you have to install the chromium web driver as well. This will be used by selenium to render websites. You can download it from here. Everything required during the course of this article is installed.

You have to create a scraper file where we will write our scraper. I am naming it emails.py.

Let’s Start Scraping Emails

Let’s first write a small code to check if everything works fine. At first, your chromium-browser might run a little slower but it will work normally after a while.

from selenium import webdriver
import time
import re

PATH = 'C:\Program Files (x86)\chromedriver.exe'



l=list()
o={}

target_url = "https://www.randomlists.com/email-addresses"


driver=webdriver.Chrome(PATH)

driver.get(target_url)


time.sleep(10)

driver.close()

The code is pretty simple and to the point. Let me explain it step by step.

  • We have imported all the libraries which we have installed at the top.
  • We have declared the PATH of our chromium browser. This is the path where the driver is installed.
  • declared the target URL.
  • Chrome instance is created using webdriver.Chrome()
  • Using .get() method we are trying to open the target URL in the browser.
  • Then time.sleep() method was used to wait for the complete website to load. In this example, we are waiting for 10 seconds for the complete rendering.
  • Finally closed the browser using .close() method.

The following is an example of what the chrome window would appear as upon a successful execution:

We will use regular expressions to identify the email and scrape it. If you are new to regular expression then read Web Scraping with Python and Regular Expression to get a better understanding of regular expressions.

Regular expressions are a powerful tool for identifying patterns within the text, similar to using the “Find” function in a word processing document, but with much greater capabilities. Regular expressions are extremely useful for validating user input and, particularly, for web scraping. They have a wide range of applications.

Let’s write the code.

from selenium import webdriver
import time
import re

PATH = 'C:\Program Files (x86)\chromedriver.exe'



l=list()
o={}

target_url = "https://www.randomlists.com/email-addresses"


driver=webdriver.Chrome(PATH)

driver.get(target_url)

email_pattern = r"[A-Za-z0-9._%+-][email protected][A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)

time.sleep(5)
print(emails)
driver.close()

We have added just three lines in the above code. Let me explain them step by step.

email_pattern variable is a regular expression that will help us to identify emails on the web page. Now, the expression is pretty straightforward but let me explain it to you.

  1. The first square bracket signifies one or more characters that are uppercase or lowercase letters, digits, period, underscore, percent, plus, or hyphen, followed by an “@” symbol.
  2. The second part shows one or more characters that are uppercase or lowercase letters, digits, periods, or hyphens, followed by a “.” symbol.
  3. The last curly bracket signifies two to four characters that are uppercase or lowercase letters.

After this, we used driver.page_source to get the raw data from the website. Then we used findall() method to get all the matching patterns in the string as a list of strings. The search for matches within the string proceeds from left to right and the matches are returned in the order in which they were found.

Once you run this code you will get the output like this.

['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

Conclusion

In this tutorial, we learn about the efficient application of regular expressions to find emails using Python. With just a few more changes you can scrape emails from any website. You just have to change the target URL.

For collecting leads, Google is also a good source. You can collect emails from google as well by making an appropriate query. Of course, you will need Web Scraping API to scrape Google at scale as it will block you in no time.

However, I have a tutorial dedicatedly made for Scraping Google with Python here. Do check it Out!!!

Forget about getting blocked while scraping the Web

Try out Scrapingdog Web Scraping API & Scrape Emails From Any Website

Additional Resources

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!