10 Tips For Web Scraping Without Getting Blocked or Blacklisted

Avoid blocking when you scrape website with these tips — Top Tips for Web Scraping to Avoid Getting Blocked

Are you tired of getting blocked while web scraping?

Or you know you can get blocked when you scrape your next desired website and you must be asking yourself can I even crawl websites without getting blocked?

Well that’s probably the case and you are looking for a solution!! Right?

Well Yes. You can!!

Although web scraping challenges are far more than you can imagine & hence it is something that has to be done quite responsibly.

You have to be very cautious about the website you are scraping. It could have negative effects on the website.

There are FREE web scraping tools in the market that can smoothly scrape any website without getting blocked. Many websites on the web do not have any anti-scraping mechanism but some of the websites do block scrapers because they do not believe in open data access.

One thing you have to keep in mind is to BE NICE and FOLLOW the SCRAPING POLICIES of the website.

Prevent Getting Blacklisted While Web Scraping with These Tips

1. ROBOTS.TXT

First of all, you have to understand what is robots.txt file is and what is its functionality. So, basically, it tells search engine crawlers which pages or files the crawler can or can’t request from your site.

This is used mainly to avoid overloading any website with requests. This file provides standard rules about scraping. Many websites allow GOOGLE to let them scrape their websites.

One can find robots.txt files on websites — http://example.com/robots.txt. Sometimes certain websites have User-agent: * or Disallow:/ in their robots.txt file which means they don’t want you to scrape their websites.

Basically, the anti-scraping mechanism works on a fundamental rule which is: Is it a bot or a human? For analyzing this rule it has to follow certain criteria in order to make a decision. Points referred by an anti-scraping mechanism:

If you are scraping pages faster than a human possibly can, you will fall into a category called “bots”.
Following the same pattern while web scraping. For example, you are going through every page of that target domain just collecting images or links.
If you are scraping using the same IP for a certain period of time.
User Agent missing. Maybe you are using a headerless browser like Tor Browser

If you keep these points in mind while web scraping a website, I am pretty sure you will be able to scrape any website on the web.

2. IP Rotation

This is the easiest way for anti-scraping mechanisms to catch you red-handed. If you keep using the same IP for every request you will be blocked instantly after some time. So, for every successful scraping request, you must use a different IP for every new request you make.

You must have a pool of at least 10 IPs before making an HTTP request. To avoid getting blocked you can use proxy rotating services like Scrapingdog or any other Proxy services.

I am putting a small python code snippet that can be used to create a pool of new IP addresses before making a request.

from bs4 import BeautifulSoup
import requests

l={}
u=list()

url=”<a href="https://www.proxynova.com/proxy-server-list/" data-type="URL" data-id="https://www.proxynova.com/proxy-server-list/" target="_blank" rel="noreferrer noopener">https://www.proxynova.com/proxy-server-list/</a>"+country_code+"/"
respo = requests.get(url).text

soup = BeautifulSoup(respo,’html.parser’)

allproxy = soup.find_all(“tr”)

for proxy in allproxy:
 foo = proxy.find_all(“td”)
 try:
     l[“ip”]=foo[0].text.replace(“\n”,””).replace(“document.write(“,””).replace(“)”,””).replace(“\’”,””).replace(“;”,””) 

except:
   l[“ip”]=None 

try:
  l[“port”]=foo[1].text.replace(“\n”,””).replace(“ “,””)
 except:
  l[“port”]=None 

try:
  l[“country”]=foo[5].text.replace(“\n”,””).replace(“ “,””)
 except:
  l[“country”]=None if(l[“port”] is not None):
  u.append(l)

 l={}
print(u)

This will provide you a JSON response with three properties which are IP, port, and country. This proxy API will provide IPs according to a country code, you can find the country code here.

But for websites that have advanced bot detection mechanisms, you have to use either mobile or residential proxies, you can again use Scrapingdog for such services. The number of IPs in the world is fixed.

By using these services you will get access to millions of IPs which can be used to scrape millions of pages. This is the best thing you can do to scrape successfully for a longer period of time.

3. User-Agent

The User-Agent request header is a character string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent. Some websites block certain requests if they contain User-Agent that doesn’t belong to a major browser.

If user agents are not set many websites won’t allow viewing their content. You can get your user agent by typing What is my user agent on google.

You can also check your user string here:
http://www.whatsmyuseragent.com/

The somewhat same technique is used by an anti-scraping mechanism that they use while banning IPs. If you are using the same user agent for every request you will be banned in no time.

So, what is the solution?

Well, the solution is pretty simple you have to either create a list of User-Agents or maybe use libraries like fake-useragents. I have used both techniques but for efficiency purposes, I will urge you to use the library.

A user-agent string listing to get you started can be found here:
http://www.useragentstring.com/pages/useragentstring.php
https://developers.whatismybrowser.com/useragents/explore/

4. Make Web scraping slower, keep Random Intervals in between

As you know the speed of crawling websites by humans and bots is very different. Bots can scrape websites at an instant pace. Making fast, unnecessary, or random requests to a website is not good for anyone. Due to this overloading of requests a website may go down.

To avoid this mistake, make your bot sleep programmatically in between scraping processes. This will make your bot look more human to the anti-scraping mechanism.

This will also not harm the website. Scrape the smallest number of pages at a time by making concurrent requests. Put a timeout of around 10 to 20 seconds and then continue scraping.

As I said earlier you have to respect the robots.txt file.

Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trials run.

Do this periodically because the environment does change over time.

5. Change in Scraping Pattern & Detect website change

Generally, humans don’t perform repetitive tasks as they browse through a site with random actions. But web scraping bots will crawl in the same pattern because they are programmed to do so.

As I said earlier some websites have great anti-scraping mechanisms. They will catch your bot and will ban it permanently.

Now, how can you protect your bot from being caught? This can be achieved by Incorporating some random clicks on the page, mouse movements, and random actions that will make a spider look like a human.

Now, another problem is many websites change their layouts for many reasons and due to this your scraper will fail to bring the data you are expecting.

For this, you should have a perfect monitoring system that detects changes in their layouts and then alert you with the current scenario. Then this information can be used in your scraper to work accordingly.

One of my friends is working in a large online travel agency and they crawl the scrape web to get the prices of their competitors . While doing so they have a monitoring system that mails them every 15 minutes about the status of their layouts.

This keeps everything on track and their scraper never breaks.

6. Headers

When you make a request to a website from your browser it sends a list of headers. Using headers, the website analyses your identity. To make your scraper look more human you can use these headers. Just copy them and paste them into your header object inside your code. That will make your request look like it’s coming from a real browser.

On top of that using IP and User-Agent Rotation will make your scraper unbreakable. You can scrape any website be it dynamic or static. I am pretty sure using these techniques you will be able to beat 99.99% anti-scraping mechanisms.

Now, there is a header “Referer”. It is an HTTP request header that lets the site know what site you are arriving from.

Generally, it’s a good idea to set this so that it looks like you’re arriving from Google, you can do this with the header:

“Referer”: “https://www.google.com/”

You can replace it with https://www.google.co.uk or google.in if you are trying to scrape websites based in the UK or India. This will make your request look more authentic and organic.

You can also look up the most common referrers to any site using a tool like https://www.similarweb.com, often this will be a social media site like Youtube or Facebook.

7. Headless Browser

Websites display their content on the basis of which browser you are using. Certain displays differently on different browsers. Let’s take the example of Google search. If the browser (identified by the user agent) has advanced capabilities, the website may present “richer” content — something more dynamic and styled which may have a heavy reliance on Javascript and CSS.

The problem with this is that when doing any kind of web scraping, the content is rendered by the JS code and not the raw HTML response the server delivers. In order to scrape these websites, you may need to deploy your own headless browser (or have Scrapingdog do it for you!).

Automation Browsers like Selenium or Puppeteer provide APIs to control browsers and Scrape dynamic websites. I must say a lot of effort goes in for making these browsers go undetectable.

But this is the most effective way to scrape a website. You can even use certain browserless services to let you open an instance of a browser on their servers rather than increasing the load on your server. Moreover, you can even open more than 100 instances at once on their services. So, all & in all it’s a boon for the Scraping industry.

8. Captcha Solving Services

Many websites use ReCaptcha from Google which lets you pass a test. If the test goes successful within a certain time frame then it considers that you are not a bot but a real human being. If you are scraping a website on a large scale, the website will eventually block you.

You will start seeing captcha pages instead of web pages. There are services to get past these restrictions such as 2Captcha.

It is a captcha solving service that provides solutions of almost all known captcha types via a simple-to-use API. It helps to bypass captchas on sites without any human involvement in such activities as SERP and data parsing, web-scraping, web automation, etc.

9. Honeypot Traps

There are invisible links to detect hacking or web scraping. Actually, it is an application that imitates the behavior of a real system.

Certain websites have installed honeypots on their system which are invisible by a normal user but can be seen by bots or web scrapers. You need to find out whether a link has the “display: none” or “visibility: hidden” CSS properties set, and if they do avoid following that link, otherwise a site will be able to correctly identify you as a programmatic scraper, fingerprint the properties of your requests, and block you quite easily.

Honeypots are one of the easiest ways for smart webmasters to detect crawlers, so make sure that you are performing this check on each page that you scrape.

10. Google Cache

Now, sometime google keeps a cached copy of some websites. So, rather than making a request to that website, you can also make a request to its cached copy. Simply prepend “http://webcache.googleusercontent.com/search?q=cache:” to the beginning of the URL. For example, to scrape documentation of Scrapingdog you could scrape “http://webcache.googleusercontent.com/search?q=cache:https://www.scrapingdog.com/documentation”.

But one thing you should keep in mind is that this technique should be used for websites that do not have sensitive information which also keeps changing.

For example, LinkedIn tells Google to not cache their data. Google also creates a cached copy of a website in a certain interval of time. It also depends on the popularity of the website.

Hopefully, you have learned new scraping tips by reading this article. I must remind you to keep respecting the robots.txt file. Also, try not to make large requests to smaller websites because they might not have the budget that large enterprises have.

Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button!

Frequently Asked Questions

Is web scraping detectable?

Ans: Yes, web scraping can be detected. There are a few ways to detect web scraping, but the most common is through weblogs. When a scraper hits a website, it leaves a trace in the log files. These log files can be analyzed to see what kind of traffic is coming from where, and how often. If there is a sudden spike in traffic from a particular IP address, it’s a good indication that someone is scraping the site.

How do you bypass bot detection?

Ans: There is no universal answer to this question, as the methods used to bypass bot detection vary depending on the specific bot detection system in place. However, some common methods used to bypass bot detection systems include using a proxy server or VPN, using a modified browser or user agent string, or creating a custom bot that imitates human behavior.

How do you stop a website from crawling?

Ans: There is no surefire way to stop a website from crawling, but there are some methods that may discourage crawlers. These include using a robots.txt file to block crawlers from certain areas of your website or using CAPTCHAs to make it more difficult for bots to access your site.

Is VPN good for scraping?

VPN is not a good choice when it comes to web scraping. VPN provides you with a fixed IP for a very long time and the target website might block you in no time. The host will identify that some script is operating at this IP and due to this your data pipeline will break.

Can you get blocked for web scraping?

Of course you can get blocked while scraping due to numerous factors like lack of IPs, browser fingerprinting, unusual headers, crawling at a very fast pace, etc. To avoid getting blocked while scraping you are advised to use Web Scraping Services

How do I stop IP ban from Web scraping?

To stop IP bans follow this practice:
1. Do not violate page rules for that website.
2. Do not crawl website at a ridiculously heavy pace.
3. Keep rotating your headers along with your IPs

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

10 Tips For Web Scraping To Avoid Getting Blocked/Blacklisted