< Back to Blog Overview

4 Best Methods To Find All The URLs on A Domain’s Website

07-06-2024

Finding all the URLs on a domain can be a critical task for various reasons, such as SEO analysis, content auditing, or simply understanding the structure of a website. Whether you’re an SEO specialist, a web developer, or just someone interested in knowing what topics a website has covered, a list of URLs on a domain is invaluable.

best methods to find urls on a domain's website
Find All the URLs on A Domain’s Website

In this article, we’ll look at 4 different methods to find all pages on a domain. By the end of this read, you’ll have a solid understanding of how web scraping and crawling can help you in this process, and you’ll have all the knowledge with practical techniques to find all the URLs on a domain efficiently.

How To Find All URLs On A Domain’s Website

Let’s explore some effective ways to find all URLs on a website. We’ll cover methods like Google search, sitemaps & robots.txt, SEO crawling tools, and building your script with Python. These methods are easy to use, and anyone can leverage them to find all the URLs on a website.

Google search technique

A simple and fast method to find URLs is Google Search. You can enter a specific search query to locate website pages. However, not all pages may be included in the search results. This is because Google may exclude certain pages from its index for reasons like duplicate content, low quality, or inaccessibility.

Sitemaps and robots.txt

Examining the website’s sitemap and robots.txt file can provide valuable insights for those comfortable with delving into technical details. These files contain important information about the website’s structure and content. Although this approach offers more accuracy, it is also more challenging.

SEO crawling tools

If you want a straightforward and hassle-free solution, you should try using an SEO spider tool. Many of these tools are easy to use and provide detailed insights, but there’s a catch: using them extensively often comes with a cost. However, if you have a smaller website with fewer than 500 pages, you can find free SEO spider tools that offer plenty of functionality.

Custom scripting

For those comfortable with coding, creating a custom script offers the most control and potentially comprehensive results. For example, you could target only URLs containing a certain keyword or exclude specific directories. However, it requires technical expertise and time investment. If you’re up for the challenge, creating a script can be rewarding.

Now, let’s explore each of these methods in detail and see how you can use them to find all URLs on a website.

Method 1: Google Site Search

One of the quickest ways to find URLs on a website is with Google’s site search feature. Here’s how to use it:

  • Go to Google.com
  • In the search bar, type site:example.com (replace example.com with the website you want to search).
  • Hit enter to see the list of indexed pages for that website.

Let’s find all the pages on scrapingdog.com. Search for site:scrapingdog.com on Google and hit enter.

Google will return a list of indexed pages for a specific domain.

However, Google may exclude certain pages for reasons such as duplicate content, low-quality, or inaccessible pages. So, the “site:” search query is best for getting rough estimates, but it may not be the most accurate measure.

One more issue with using a search operator like this is that it won’t return pages that aren’t indexed in its engine.

Using Scrapingdog to organize search results

While Google search is easy, what if you need the page URLs and titles? Manually copying them is inefficient.

A better approach is to use Scrapingdog Google Search Result Scraper API. It organizes search results in a user-friendly JSON format. Scrapingdog offers a free trial, go and register here.

After registering, you’ll see your dashboard with your API key and remaining credits. Ignore the key for now. On the left, find “Google Scraper” and click it.

scrapingdog dashboard
Scrapingdog Dashboard

In the Search field, enter your search query, specify the desired number of results to scrape (e.g., 10), and then click the Scrape button.

Google Search API Output
Google Search API Output

You’ll receive your output in a clean JSON format as shown below. You can simply copy this JSON document and use it for your needs.

{
    "description": "Welcome Back. Sign In. I am a New User · Forgot Password.",
    "synopsis": null,
    "title": "Login",
    "link": "<https://api.scrapingdog.com/login>",
    "position": 1
},
{
    "description": "Start Scraping in Seconds. Register. By creating account you agree to our Terms & Privacy Policy. I am an Existing User.",
    "synopsis": null,
    "title": "Register",
    "link": "<https://api.scrapingdog.com/register>",
    "position": 2
},
{
    "description": "Scrapingdog is the best SerpAPI alternative. Scrape Google Search results at 78% less cost.",
    "synopsis": null,
    "title": "An Economical SerpAPI Alternative for Google Search API",
    "link": "<https://www.scrapingdog.com/serpapi-alternative>",
    "position": 3
}

Method 2: Sitemaps and robots.txt

Another way to find website URLs is by checking its sitemap and robots.txt files. While this approach is more technical, it can yield more detailed results. These files offer valuable information about a website’s structure and content. Let’s explore how sitemaps and robots.txt files can help us uncover all a website’s URLs.

Using sitemaps

A sitemap is an XML file listing all important website pages for search engine indexing. Webmasters use it to help search engines understand the website’s structure and content for better indexing.

Every decent website has a sitemap as it improves Google rankings and is considered a good SEO practice.

Here’s what a standard sitemap looks like:

XML Sitemap
XML Sitemap

The <loc> element specifies the page URL, <lastmod> indicates the last modification time, and <priority> signifies the relative importance for search engines (higher priority means more frequent crawling).

Now, where to find a sitemap? Check for /sitemap.xml on the website (e.g., https://www.scrapingdog.com/sitemap.xml). Websites can have multiple sitemaps in various locations, including: /sitemap.xml.gz, /sitemap_index.xml, /sitemap_index.xml.gz, /sitemap.php, /sitemapindex.xml, /sitemap.gz.

The robots.txt file, which we’ll discuss next, often lists all the sitemap locations.

Using robots.txt

The robots.txt file instructs search engine crawlers on which pages to index and which ones to exclude from indexing. It can also specify the location of the website’s sitemap. The file is often located at the /robots.txt path (e.g., https://scrapingdog.com/robots.txt).

Here’s an example of a robots.txt file. Some routes are disallowed for indexing. The sitemap location is also present.

robots.txt file

You need to visit both sitemaps and find all the URLs within the website. Note that, for smaller sitemaps, you can manually copy the URLs from each <loc> tag. But for larger sitemaps, consider using an online tool to convert the XML format to a more manageable format, such as CSV. There are free tools available, like SEOWL XML Sitemap URL Extractor.

SEOWL XML Sitemap Extractor
SEOWL XML Sitemap Extractor Tool Dashboard

Method 3: SEO Crawling Tools

Now let’s see how SEO crawling tools help us find all website pages. There are various SEO crawlers in the market, we’ll explore the free tool XML-Sitemaps.com. Enter your URL and click “START” to create a sitemap. This tool is suitable when you need to quickly create a sitemap for a small website (up to 500) pages.

XML-Sitemap.com
XML-Sitemap.com

The process will start and you will see the number of pages scanned (167 in this case) and the number of pages indexed (127 in this case). This indicates that only around 127 of the scanned pages are currently indexed in Google Search.

Once the crawling process is complete, the sitemap preview will display all the website’s indexed URLs, including the last modification date and time, as well as the priority of each URL.

You can download xml sitemap file or receive it via email and put it on your website after that.

Method 4: Building Your Crawler Script

If you’re a developer, you can build your crawler script to find all URLs on a website. This method offers more flexibility and control over crawling compared to previous methods. It allows you to customize behavior, handle dynamic content, and extract URLs based on specific patterns or criteria.

You can use any language for web crawling, such as Python, JavaScript, or Golang. In this example, we will focus on Python. Its simplicity and rich ecosystem make it a popular choice for web crawling tasks.

Python offers a vast range of libraries and frameworks specifically designed for web crawling and data extraction, including popular ones like RequestsBeautifulSoup, and Scrapy.

Using Python libraries

First, we’ll create a Python script using BeautifulSoup and Requests to extract URLs from a website sitemap.

Install the libraries using pip:

pip install beautifulsoup4 requests

Next, let’s create a Python file (in my case, main.py) and import the necessary dependencies:

import requests
from bs4 import BeautifulSoup as Soup
import csv

Next, define the extract_sitemap_data function. The extract_sitemap_data function takes the URL of a sitemap as input and returns a list of URL entries found in the sitemap. This function uses BeautifulSoup to parse the XML content of the sitemap. It recursively extracts data from nested sitemaps if any, and also parses the URL entries from the current sitemap.

def extract_sitemap_data(url):
    if not url:
        raise ValueError("Please provide a valid sitemap URL.")
    try:
        response = requests.get(url)

        # Raise an exception for unsuccessful requests (status codes other than 200)
        response.raise_for_status()
    except (ValueError, requests.exceptions.RequestException) as e:
        print(f"Error fetching sitemap data from {url}: {e}")
        return []
    soup = bs(response.content, "xml")

    # Extract data from nested sitemaps
    url_entries = []
    for sitemap in soup.find_all("sitemap"):
        nested_url = sitemap.find("loc").text
        url_entries.extend(extract_sitemap_data(nested_url))

    # Extract data from current sitemap URLs
    url_entries.extend(_parse_url_entries(soup))

    return url_entries

Next, the _parse_url_entries function is defined. It takes a BeautifulSoup object representing the XML content of a sitemap and extracts URL entries along with their attributes such as location (loc), last modified time (lastmod), and priority (priority). It returns a list of URL entries.

def _parse_url_entries(soup):
    url_entries = []
    attributes = ["loc", "lastmod", "priority"]
    for url in soup.find_all("url"):
        entry = []
        for attr in attributes:
            element = url.find(attr)
            entry.append(element.text if element else "n/a")
        url_entries.append(entry)
    return url_entries

Next, the process_sitemaps function takes a list of sitemap URLs as input, extracts data from each sitemap using the extract_sitemap_data function, and aggregates all the URL entries. Finally, it calls the save_to_csv.

def process_sitemaps(sitemaps):
    all_entries = []
    for sitemap_url in sitemaps:
        entries = extract_sitemap_data(sitemap_url)
        all_entries.extend(entries)

    # Save the extracted data to a CSV file
    save_to_csv(all_entries)

Here’s the complete code:

import requests
from bs4 import BeautifulSoup as bs
import csv

def extract_sitemap_data(url):
    if not url:
        raise ValueError("Please provide a valid sitemap URL.")
    try:
        response = requests.get(url)

        # Raise an exception for unsuccessful requests (status codes other than 200)
        response.raise_for_status()
    except (ValueError, requests.exceptions.RequestException) as e:
        print(f"Error fetching sitemap data from {url}: {e}")
        return []
    soup = bs(response.content, "xml")

    # Extract data from nested sitemaps
    url_entries = []
    for sitemap in soup.find_all("sitemap"):
        nested_url = sitemap.find("loc").text
        url_entries.extend(extract_sitemap_data(nested_url))

    # Extract data from current sitemap URLs
    url_entries.extend(_parse_url_entries(soup))

    return url_entries

def _parse_url_entries(soup):
    url_entries = []
    attributes = ["loc", "lastmod", "priority"]
    for url in soup.find_all("url"):
        entry = []
        for attr in attributes:
            element = url.find(attr)
            entry.append(element.text if element else "n/a")
        url_entries.append(entry)
    return url_entries

def process_sitemaps(sitemaps):
    all_entries = []
    for sitemap_url in sitemaps:
        entries = extract_sitemap_data(sitemap_url)
        all_entries.extend(entries)

    # Save the extracted data to a CSV file
    save_to_csv(all_entries)

def save_to_csv(data, filename="data.csv"):
    with open(filename, "a", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(data)

sitemaps = [
    "<https://www.scrapingdog.com/sitemap.xml>",
    "<https://www.scrapingdog.com/blog/sitemap_index.xml>"
]

process_sitemaps(sitemaps)

The crawled URLs will be stored in a CSV file, as shown below:

CSV File With All URLs
CSV File With All URLs

What if a website doesn’t have a sitemap? Although sitemaps are common, some websites may not have one. No worries! A third-party Python library ultimate-sitemap-parser can help. To install it, simply run the command pip install ultimate-sitemap-parser.

Here’s the code:

import csv
from usp.tree import sitemap_tree_for_homepage

# Retrieve all pages from the sitemap of the given domain
def fetch_pages_from_sitemap(domain):
    raw_pages = []
    sitemap_tree = sitemap_tree_for_homepage(domain)
    for page in sitemap_tree.all_pages():
        raw_pages.append(page.url)
    return raw_pages

# Filter out duplicate pages and return a list of unique page links
def filter_unique_pages(raw_pages):
    unique_pages = []
    for page in raw_pages:
        if page not in unique_pages:
            unique_pages.append(page)
    return unique_pages

# Save the list of unique pages to a CSV file
def save_pages_to_csv(pages, filename):
    with open(filename, mode="w", newline="") as file:
        writer = csv.writer(file)
        for page in pages:
            writer.writerow([page])

if __name__ == "__main__":
    domain = "<https://scrapingdog.com/>"
    raw_pages = fetch_pages_from_sitemap(domain)
    unique_pages = filter_unique_pages(raw_pages)
    save_pages_to_csv(unique_pages, "data.csv")

The first function takes a domain URL as an argument. It then finds the sitemap associated with the provided domain using the sitemap_tree_for_homepage function and processes it to find all indexed web pages. This results in a list containing all the web pages the library could find.

The second function takes the previously created list and removes all duplicates. This leaves you with a clean list of every unique URL the website is hosting.

The result is:

This code serves as a great starting point. However, if you’re looking for a more robust solution, you can refer to our detailed guide on Building a Web Crawler in Python.

Using Scrapingdog to send requests

Now that you know how to scrape Google search results, let’s explore a solution that can help you scrape millions of Google pages without being blocked.

We’ll use Scrapingdog Google Search Result Scraper API for this task. This API handles everything from proxy rotation to headers. You just have to send a GET request and in return, you will get parsed JSON data.

Google Search Result Scraper API
Google Search Result Scraper API

This API offers a free trial and you can register for that trial from here. After registering for a free account you should read the docs to get the complete idea of this API.

import requests
import json

api_key = "YOUR_API_KEY"
url = "<https://api.scrapingdog.com/google/>"
params = {
    "api_key": api_key,
    "query": "site:scrapingdog.com",
    "results": 150,
    "country": "in",
    "page": 0
}

response = requests.get(url, params=params)

if response.ok:
    data = response.json()
    # Save the data to a JSON file
    with open('scrapingdog_data.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print("Data saved to scrapingdog_data.json")
else:
    print(f"Request failed with status code: {response.status_code}")

The code is simple. We are sending a GET request to https://api.scrapingdog.com/google/ along with some parameters. For more information on these parameters, you can again refer to the documentation.

Once you run this code you will get a beautiful JSON response.

JSON Response
JSON Response

What if you need results from a different country? Google shows different results based on location, but you can adjust. Just change the ‘country’ parameter in the code above. For example, to get results from the UK, use ‘gb’ (its ISO code). You can also extract the desired number of results by changing the ‘results’ parameter.

What All You Can Do With These URLs

You now have a CSV file containing URLs. What can you do with them? For your objective, you can extract the desired information from each webpage, like text, images, videos, metadata, or any other relevant content.

Once the data is scraped, you can organize and analyze it. This process could involve storing it in a database, generating reports, or visualizing it to identify trends or patterns.

There are plenty of resources available to assist you in the data extraction process. Below are some informative articles that provide a wealth of information.

  1. Web Scraping for Market Research in 2024
  2. Build a Web Crawler in Python
  3. Web Crawling using Javascript & Nodejs
  4. Web Scraping with Javascript and NodeJS
  5. Web Scraping in Python
  6. How To Avoid Getting Blocked While Scraping the Web

Finally, take a look at ScrapingDog API. The user-friendly web scraping API manages rotating proxies, headless browsers, and CAPTCHAs, making web scraping effortless for you. This allows you to focus on the data and insights you truly need.

Manthan Koolwal

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Scrapingdog Logo

Try Scrapingdog for Free!

Free 1000 API calls of testing.

No credit card required!

DMCA.com Protection Status