Scrapingdog
< Back to Blog Overview

Web Scraping with C#

22-04-2022

In this post, we are going to learn web scraping with c# by creating a simple web scraper using

selenium in c-sharp and then in the second part, we are going to scrape dynamic websites.

Web Scraping with Csharp

Selenium in C-sharp (Part-I)

I am going to explain everything with a very simple step-by-step approach. So, first of all, I am going to create a new project. I have created a demo project by the name of web scraping then we will install the selenium library. We can do this by clicking the Tools tab and then NuGet package manager.

image

Then you can search for selenium and then I will pick the solution and install it.

image 1

Now, we also need to download the web driver which you can download from here. Now, I am going to put it inside the bin folder so that I don’t have to specify a path within the code. I am going to open the folder in File explorer

bin > debug

And then you can just drag the chromedriver file to the debug folder. We are done with the installation. Now, we will start coding the scraper.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
}
}
}

Basically, when you want to make a controllable browser you just have to create an instance of this webdriver and since I am using the chrome web driver I am going to make a chrome fiber and I am just going to import all the packages. If I run this code it will basically open a new browser. For this post, we are going to make a Google search scraper.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.google.com")
}
}
}

Now, here we are navigating our page to the Google home page.

image 2

Our scraper will scrape all those titles, descriptions, links, etc. So, in order to do that, we have to make it a search.

image 3

There are many ways by which you can detect this input field like class name, id, or even XPath. Here we are going to use the XPath.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.google.com")
var element = driver.findElement(By.XPath("/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"));
}
}
}

Then we have to type a keyword in that input field in order to make some Google search. We are going to use the SendKeys function for that. Then we are going to submit the query using Submit function.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.google.com")
var element = driver.findElement(By.XPath("/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"));
element.SendKeys("webshop")
element.Submit();
}
}
}

So, our scraper will be able to search for keyword webshop. Now, we will expand this scraper so that it will be able to scrape titles and other data from google searches.

image 4

When we visit the above page we can do the same thing which we did with the input field, find some random element, and copy the XPath. Now, the problem in doing this is when you do it using XPath it will be for a very specific element and it might not apply to other elements.

Now, if you will notice the XPath of two random titles, then you will find out that they are in a sequence and we can use a for loop to scrape those titles.

XPath1 — //*[@id=”rso”]/div[9]/div/div[1]/div[1]/a/h3

XPath2 — //*[@id=”rso”]/div[10]/div/div[1]/div[1]/a/h3

So, now we can do something like this to find the elements using those XPaths.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.google.com")
var element = driver.findElement(By.XPath("/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"));
element.SendKeys("webshop")
element.Submit();
var titles = driver.FindElements(By.XPath("//*[@id=”rso”]/div[1]/div/div[1]/div[1]/a/h3"))
}
}
}

Now, we can loop through these elements and each one of these will be contained within this collection of elements.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.google.com")
var element = driver.findElement(By.XPath("/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"));
element.SendKeys("webshop")
element.Submit();
var titles = driver.FindElements(By.XPath("//*[@id=”rso”]/div[1]/div/div[1]/div[1]/a/h3"))
foreach(var title in titles)
{
Console.WriteLine(title.Text)
}
}
}

When we run this we get this.

As you can notice we have managed to scrape all the titles and it has ignored the ads. This is one way you can do it. I did it using XPath but you can also do it using class names. Let’s try it with the class name as well.

When you inspect any element you will find the class name.

Our target class name is LC20lb, you can notice that in the above image. We will use Javascript to scrape it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.google.com")
var element = driver.findElement(By.XPath("/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"));
element.SendKeys("webshop")
element.Submit();
var titles = driver.FindElements(By.ClassName("LC20lb"))
foreach(var title in titles)
{
Console.WriteLine(title.Text)
}
}
}

When we run this code, we get the same results.

image 5

Selenium in C-sharp (Part-II)

In this section, we are going to learn to create a very simple web scraper that can scrape content that gets loaded dynamically. We are going to scrape this page.

image 6

This purple area is actually dynamic content and that basically means that if I click on one of these menu items, the purple area will be changed to some other element but the navbar will remain the same. So, we are going to scrape these collections. So, let’s get started.

image 7

First we will scrape content names. Content name is stored under class card__contents.

image 8

There are 19 elements with class card__contents. So, the first step is to grab all those elements using the FindElements function and create a list.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var collections = driver.FindElements(By.ClassName("card__contents"));
}
}
}

Now, we will loop through that list to get the text out of those elements.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var collections = driver.FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
}
}
}

So, what we are doing is we are going to open the home page and then we will click on the discover button on the top of the page before finding any card contents.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var discoverButton = driver.FindElement(By.Id("menu-item-discover"));
discoverButton.Click();
var collections = driver.FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
}
}
}

Now, actually, we have a web scraper right here but since it’s dynamic content we will not really get any elements or collection names. So, I am going to just run it and show you.

image 9

You can see nothing gets printed out.

Now, we will make a new method. Obviously, if you were making a real web scraper you would probably want to have another class, and then instead of a static class or static method like I did you would have a method on the web scraper class.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var discoverButton = driver.FindElement(By.Id("menu-item-discover"));
discoverButton.Click();
var collections = driver.FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
}
static IReadOnlyCollection<IWebElement> FindElements(By by)
{
return driver.FindElements(by);
}
}
}

Now, if I run this function, it will work the same as the above code (collections variable). What I wanted to do is basically is to run a loop and then for each iteration, it will try and find the elements by using this by and if the collections contain elements then it will return them otherwise it will try again and again.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var discoverButton = driver.FindElement(By.Id("menu-item-discover"));
discoverButton.Click();
var collections = FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
}
static IReadOnlyCollection<IWebElement> FindElements(By by)
{
while(true){
var elements = driver.FindElements(by);
if(elements.Count > 0)
return elements;
Thread.Sleep(10);
}
}
}
}

You can see we are only returning when it contains elements otherwise it will not return anything. We have also created a thread that goes to sleep after 10 milliseconds.

Now, we can test our script once again.

image 10

So, it works and it also prints out other stuff but that doesn’t really matter. Now, the problem is we have a while true and while that might be okay in some cases in other cases it might not be. So, take for example you have a proxy, it runs, and all of a sudden it doesn’t work anymore. So, it’s just going to look in this while loop forever or maybe the document doesn’t even contain these elements. So, we have to put a limit on this to avoid problems. So, I am just going to use a stopwatch.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var discoverButton = driver.FindElement(By.Id("menu-item-discover"));
discoverButton.Click();
var collections = FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
}
static IReadOnlyCollection<IWebElement> FindElements(By by)
{
Stopwatch w = Stopwatch.StartNew();
while(w.ElapsedMilliseconds < 10 * 1000){
var elements = driver.FindElements(by);
if(elements.Count > 0)
return elements;
Thread.Sleep(10);
}
return null;
}}
}

So, if the elapsed time is greater than 10 seconds our scraper will return null. Let’s test it.

image 11

Now, what if we have a very slow internet connection. For that, we will use ChomeNetworkConditions.

Slow Internet Connection

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
ChromeNetworkConditions conditions = new ChromeNetworkConditions();
conditions.DownloadThroughput = 25 * 1000;
conditions.UploadThroughput = 10 * 1000;
conditions.Latency = TimeSpan.From.Milliseconds(1);
driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.reverbnation.com")
var discoverButton = driver.FindElement(By.Id("menu-item-discover"));
discoverButton.Click();
var collections = FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
}
static IReadOnlyCollection<IWebElement> FindElements(By by)
{
Stopwatch w = Stopwatch.StartNew();
while(w.ElapsedMilliseconds < 10 * 1000){
var elements = driver.FindElements(by);
if(elements.Count > 0)
return elements;
Thread.Sleep(10);
}
return null;
}}
}

Since we have saved the chrome driver as a web driver so I’m just going to make the driver as ChromeDriver.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace webscraping
{
class Program
{
static void Main(string[] args)
{
ChromeNetworkConditions conditions = new ChromeNetworkConditions();
conditions.DownloadThroughput = 25 * 1000;
conditions.UploadThroughput = 10 * 1000;
conditions.Latency = TimeSpan.From.Milliseconds(1);
driver = new ChromeDriver();
(driver as ChromeDriver).NetworkConditions = conditions;
driver.Navigate().GoToUrl("https://www.reverbnation.com")var discoverButton = driver.FindElement(By.Id("menu-item-discover"));
discoverButton.Click();
var collections = FindElements(By.ClassName("card__contents"));
foreach(var collection in collections)
Console.WriteLine(collection.Text)
Console.WriteLine("done");
}
static IReadOnlyCollection<IWebElement> FindElements(By by)
{
Stopwatch w = Stopwatch.StartNew();
while(w.ElapsedMilliseconds < 10 * 1000){
var elements = driver.FindElements(by);
if(elements.Count > 0)
return elements;
Thread.Sleep(10);
}
return new ReadOnlyCollection<IWebElement>(new List<IWebElement>);
}}
}

So, now it should actually simulate a slow connection. This is a very important part of creating a web scraper where we learned about handling the slow connections as well. Since you often want to have many proxies running and maybe some of the proxies could be slow.

Conclusion

In this tutorial, we created a very simple web scraper using selenium and C#. First, we created a scraper where we make a google search and then scrape those results. In the second section, we focused on dynamic web scraping and slow connection proxies. Now, you are trained to create commercial scrapers for your new business ventures.

If you don’t want to code your own scraper then you can always use our web scraping API.

Feel free to message us to inquire about anything you need clarification on.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

Manthan Koolwal

My name is Manthan Koolwal and I am the CEO of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status
Wordpress Social Share Plugin powered by Ultimatelysocial
RSS
Follow by Email
Pinterest
fb-share-icon