Proxy has become a crucial tool in today’s digital tool. It is like a private security team for your business. Using proxies you can bypass any protection wall established by either a government or a private entity. You can look for hidden or censored data while being anonymous.
In this article, we will talk about how proxies can be used with Python and Requests package to scrape the web but before we start with that we will talk about applications of using a proxy pool.
Applications of Proxy
- Web Scraping– You can make multiple requests to any website from different IPs on every request. This makes you anonymous and you can gather information without getting blocked.
- Load Balancing– Proxies can help you distribute the client load across multiple servers at the backend.
- Geo-Fencing– You can access information that is restricted to a certain area or country.
- Security– Proxies can act as a barrier between the server and the client. It can filter incoming requests just like how Cloudflare works.
- Anonymity– Proxies can help you hide your real IP. This will make it challenging for websites that want to track your activity.
Setting up the prerequisites
We are going to use Python 3.x
for this article. I hope you have already installed this on your machine and if not then please download it from here.
Then create a dedicated folder for this tutorial and then create a Python file inside the folder by any name you like. I am naming the file as proxy.py
.
mkdir tutorial
Now, install the requests package with this command inside the folder tutorial
.
pip install requests
How to use Proxy with Python Requests?
The first step would be to import requests
library inside proxy.py
file.
import requests
The next step would be to create a proxy dictionary containing two key-value pair, with the keys 'http'
and 'https'
. Each represents a communication protocol (HTTP and HTTPS) and their respective proxy server URLs.
proxies = { 'http': 'http://proxy.example.com:8081', 'https': 'http://proxy.example.com:8081', }
Currently values of http
and https
are the same but they could be different as well. You can use different URLs for both protocols depending on the website you are going to handle with your scraper.
The third and last step would be to make an HTTP GET request to the target website using requests.
response = requests.get('https://books.toscrape.com/', proxies=proxies)
You can of course make many other kinds of HTTP requests .post()
.delete()
or .put()
.
Generally when you access proxies other than the public ones (which I would suggest you avoid) requires authentication. Let’s see how to deal with those kinds of proxies.
How to Authenticate when Using a Proxy Server with Python Requests
When you buy a proxy online they require a username and password to access. You can provide the credentials using either basic authentication or an authenticated proxy URL.
You can provide the full proxy URL, including the authentication details, in the proxies
dictionary.
import requests # Replace with your authenticated proxy URL authenticated_proxy_url = 'http://username:[email protected]:8081' proxies = { 'http': authenticated_proxy_url, 'https': authenticated_proxy_url, } response = requests.get('https://www.example.com', proxies=proxies) print(response.text)
But always remember that using proxy authentication credentials directly in your code is not recommended, especially in production environments or when sharing your code with others, as it poses a security risk. Instead, you should use environment variables or configuration files to store sensitive information securely.
This is how you can set environment variables in Python.
export HTTP_PROXY=http://proxy.example.com:8081 export HTTPS_PROXY=http://proxy.example.com:8081
Handling Proxy Errors and Exceptions
For a smooth flow of your code, you should always handle proxy errors that might occur while handling large concurrent requests. Not handling errors will lead to breakage of your code. These errors can occur due to incorrect proxy configurations, network issues, or server unavailability.
Here are some common proxy-related errors and how to handle them in Python:
Proxy Connection Errors
requests.exceptions.ProxyError
– This exception is raised when there is an issue connecting to the proxy server. It could be due to the proxy server being down or unavailable.
import requests proxy = 'http://username:[email protected]:8081' target_url = 'https://www.example.com' try: response = requests.get(target_url, proxies={'http': proxy, 'https': proxy}) response.raise_for_status() # Raise an exception for HTTP error responses print(response.text) except requests.exceptions.ProxyError as e: print(f"Proxy connection error: {e}") except requests.exceptions.RequestException as e: print(f"Request error: {e}")
Proxy Authentication Errors
requests.exceptions.ProxyError
– This exception can also be raised if there are issues with proxy authentication. For authenticated proxies, make sure you provide the correct username and password.
mport requests proxy = 'http://username:[email protected]:8081' target_url = 'https://www.example.com' try: response = requests.get(target_url, proxies={'http': proxy, 'https': proxy}) response.raise_for_status() print(response.text) except requests.exceptions.ProxyError as e: print(f"Proxy connection error: {e}") except requests.exceptions.RequestException as e: print(f"Request error: {e}")
Timeout Errors
requests.exceptions.Timeout
– This exception occurs when the request to the proxy or the target server times out. You can specify a timeout in the requests.get()
or requests.post()
function to handle timeout errors.
import requests proxy = 'http://username:[email protected]:8081' target_url = 'https://www.example.com' try: response = requests.get(target_url, proxies={'http': proxy, 'https': proxy}, timeout=10) response.raise_for_status() print(response.text) except requests.exceptions.Timeout as e: print(f"Timeout error: {e}") except requests.exceptions.RequestException as e: print(f"Request error: {e}")
Handling General Exceptions
Always use a broad except
block to catch general exceptions (Exception
) in case there are other unexpected errors not covered by the specific exception types.
import requests proxy = 'http://username:[email protected]:8081' target_url = 'https://www.example.com' try: response = requests.get(target_url, proxies={'http': proxy, 'https': proxy}, timeout=10) response.raise_for_status() print(response.text) except Exception as e: print(f"An unexpected error occurred: {e}")
This way you can make your code more robust and resilient when using proxies with Python requests
. Additionally, you can log or display meaningful error messages to help with debugging and troubleshooting.
Rotating Proxies with Requests
Well, many of you might not be aware of rotating proxies so let me explain this in very simple language. Rotating proxies is like a group of friends who can help you open the doors of any specific website. This group could be in the millions. This way you will never get blocked because you have a new friend on every visit.
In technical terms, these friends are IPs from different locations in any country around the globe. While scraping any website it is always advised to use different IPs on every request because many websites have anti-scraping software like Cloudflare which prevents large amounts of requests from a single IP.
Of course, just changing IPs will not bypass this anti-scraping wall but not changing IPs could definitely lead to blockage of your data pipeline.
Let’s now write a small Python code for rotating proxies with requests
. We will scrape a sample website with a new IP on every request.
import requests import random proxy_list = ['http://50.169.175.234:80','http://50.204.219.228:80','http://50.238.154.98:80'] scraping_url = input('Enter a url to scrape.\n') print('We will now scrape',scraping_url) proxies = { 'http': random.choice(proxy_list), 'https': random.choice(proxy_list), } try: response = requests.get(scraping_url, proxies=proxies) print(response.text) except requests.exceptions.ProxyError as e: print(f"Proxy connection error: {e}") except requests.exceptions.RequestException as e: print(f"Request error: {e}")
Let me break down this code for you.
import requests
: This line imports therequests
library, which is used to make HTTP requests and handle responses.import random
: This line imports therandom
module, which will be used to select a random proxy from theproxy_list
.proxy_list
: This is a list that contains several proxy server URLs. Each URL represents a proxy through which the web request will be sent.scraping_url = input('Enter a url to scrape.\n')
: This line takes user input and prompts the user to enter a URL to be scraped.print('We will now scrape',scraping_url)
: This line prints the provided URL to indicate that the scraping process is starting.proxies = {'http': random.choice(proxy_list), 'https': random.choice(proxy_list)}
: This creates a dictionary calledproxies
, which will be used to pass the randomly selected proxy to therequests.get()
function. Therandom.choice()
function selects a random proxy from theproxy_list
for both HTTP and HTTPS requests.- The
try
block: This block is used to make the HTTP request to the provided URL using the randomly selected proxy. response = requests.get(scraping_url, proxies=proxies)
: This line sends an HTTP GET request to thescraping_url
using therequests.get()
function. Theproxies
parameter is used to pass the randomly selected proxy for the request.print(response.text)
: If the request is successful, the response content (HTML content of the webpage) is printed to the console.- The
except
block: This block is used to handle exceptions that might occur during the request. except requests.exceptions.ProxyError as e:
: If there is an issue with the selected proxy, this exception will be caught, and the error message will be printed.except requests.exceptions.RequestException as e:
: This exception is a general exception for any other request-related errors, such as connection errors or timeout errors. If such an error occurs, the error message will be printed.
But this code has a problem. The problem is we have used public proxies for this code and they are already blocked by many websites. So, here we will use something which is private and free too.
Using Scrapingdog Rotating Proxies with Requests to scrape websites
Scrapingdog provides generous free 1000 credits. You can sign up for the free account from here.
Once you sign up you will see a API key on your dashboard. You can use that API key in the code below.
import requests import random proxies = { 'http': 'http://scrapingdog:[email protected]:8081', 'https': 'http://scrapingdog:[email protected]:8081', } scraping_url = input('Enter a url to scrape.\n') print('We will now scrape',scraping_url) //url should be http only try: response = requests.get(scraping_url, proxies=proxies, verify=False) print(response.text) except requests.exceptions.ProxyError as e: print(f"Proxy connection error: {e}") except requests.exceptions.RequestException as e: print(f"Request error: {e}")
In place of Your-API-key paste your own API key. Scrapingdog has a pool of more than 15M proxies with which you can scrape almost any website. Scrapingdog not just rotates IPs it also handles headers and retries for you. This way you always get the data you want in just a single hit.
Remember to use http
scraping urls while scraping instead of https
.
Conclusion
Proxies have many applications as discussed above and web scraping is one of them. Now, proxies are also of different types and if you want to learn more about the proxies then you should read best datacenter proxies.
The quality of proxies always matters when it comes to web scraping or internet browsing. To be honest there are tons of options in the market when it comes to rotating proxies but only a few of them work.
Before we wrap up I would advise you to read web scraping with Python to get an in-depth knowledge on web scraping. This article is for everyone from beginner to advance. This tutorial covers everything from data downloading to data parsing. Check that out!
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey: