Email Scraping has become a popular and efficient method for obtaining valuable contact information from the internet. By learning how to scrape emails, businesses and individuals can expand their networks, gather leads, and conduct market research more effectively. In this article, we will look at how to extract email addresses from websites using Python.
In this tutorial, we will be using web scraping & will create an email scraper using Python and regular expression. Our target website for emails will be this webpage. Selenium will be used here because this website uses JavaScript to render its data.
Setting up the prerequisites
I am assuming that you have already installed Python 3.x on your machine. If not then you can download it from here. First, you need to create a folder where we will keep our scraper files.
mkdir email_scraper
After this, you have to install the necessary libraries and web drivers.
pip install selenium pip install beautifulsoup4
Along with this, you have to install the Chromium web driver as well. This will be used by Selenium to render websites. You can download it from here. Everything required during the course of this article is installed.
You have to create a scraper file where we will write our scraper. I am naming it emails.py.
Let’s Start Scraping Emails
Let’s first write a small code to check if everything works fine. At first, your chromium-browser might run a little slower but it will work normally after a while.
from selenium import webdriver import time import re PATH = 'C:\Program Files (x86)\chromedriver.exe' l=list() o={} target_url = "https://www.randomlists.com/email-addresses" driver=webdriver.Chrome(PATH) driver.get(target_url) time.sleep(10) driver.close()
- We have imported all the libraries which we have installed at the top.
- We have declared the PATH of our Chromium browser. This is the path where the driver is installed.
- declared the target URL.
- Chrome instance is created using webdriver.Chrome()
- Using .get() method we are trying to open the target URL in the browser.
- Then time.sleep() method was used to wait for the complete website to load. In this example, we are waiting for 10 seconds for the complete rendering.
- Finally closed the browser using .close() method.
We will use regular expressions to identify the email and scrape it. If you are new to regular expression then read Web Scraping with Python and Regular Expression to get a better understanding of regular expressions.
Regular expressions are a powerful tool for identifying patterns within the text, similar to using the “Find” function in a word processing document, but with much greater capabilities. Regular expressions are extremely useful for validating user input and, particularly, for web scraping. They have a wide range of applications.
Let’s write the code.
from selenium import webdriver import time import re PATH = 'C:\Program Files (x86)\chromedriver.exe' l=list() o={} target_url = "https://www.randomlists.com/email-addresses" driver=webdriver.Chrome(PATH) driver.get(target_url) email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}" html = driver.page_source emails = re.findall(email_pattern, html) time.sleep(5) print(emails) driver.close()
- The first square bracket signifies one or more characters that are uppercase or lowercase letters, digits, period, underscore, percent, plus, or hyphen, followed by an “@” symbol.
- The second part shows one or more characters that are uppercase or lowercase letters, digits, periods, or hyphens, followed by a “.” symbol.
- The last curly bracket signifies two to four characters that are uppercase or lowercase letters.
Conclusion
In this tutorial, we learn about the efficient application of regular expressions to find emails using Python. With just a few more changes you can scrape emails from any website. You just have to change the target URL.
For collecting leads, Google is also a good source. You can collect emails from Google as well by making an appropriate query. Of course, you will need an API for web scraping to extract data from Google at scale as it will block you in no time.
However, I have a tutorial dedicated made for Scraping Google search results with Python here. Check it Out!!!
Forget about getting blocked while scraping the Web
Try out Scrapingdog Web Scraping API & Scrape Yellow Pages at Scale without Getting Blocked