Web Crawler

Updated on 10 July, 2025
Web Crawler header image

Problem Statement

In this task, you're required to implement a web crawler using a given startUrl and an interface HtmlParser. The purpose of the crawler is to fetch all URLs that share the same hostname as the startUrl. This entails:

  1. Initiating the crawl from the startUrl.
  2. Utilizing the HtmlParser.getUrls(url) method to retrieve a list of URLs from any given webpage represented by the URL.
  3. Ensuring each link is crawled only once to avoid redundancy.
  4. Restricting the crawl to links that share the same hostname as the startUrl.

The concept of "hostname" in this context is straightforward and mirrors what can be observed in typical URLs, where, for example, both http://news.example.com/articles and http://news.example.com/photos share the same hostname news.example.com. Only URLs that match the protocol (assumed to be HTTP for all links) and the hostname of the startUrl will be considered for crawling.

The HtmlParser interface provides the crucial method getUrls, which when given a URL returns all linked URLs from that specific webpage. Note that even URLs distinguished only by a trailing slash (e.g., http://site.org vs. http://site.org/) are treated as different entities.

Examples

Example 1

Input:

urls = [
  "http://news.example.com",
  "http://news.example.com/news",
  "http://news.example.com/news/topics/",
  "http://news.google.com",
  "http://news.example.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.example.com/news/topics/"

Output:

[
  "http://news.example.com",
  "http://news.example.com/news",
  "http://news.example.com/news/topics/",
  "http://news.example.com/us"
]

Example 2

Input:

urls = [
  "http://news.example.com",
  "http://news.example.com/news",
  "http://news.example.com/news/topics/",
  "http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"

Output:

["http://news.google.com"]

Explanation:

The startUrl links to pages with different hostnames, which are ignored.

Constraints

  • 1 <= urls.length <= 1000
  • 1 <= urls[i].length <= 300
  • startUrl is one of the urls.
  • Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from 'a' to 'z', digits from '0' to '9' and the hyphen-minus character ('-').
  • The hostname may not start or end with the hyphen-minus character ('-').
  • You may assume there are no duplicates in the url library.
  • See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames

Approach and Intuition

The problem can be approached by understanding the structure and connectivity of web pages through their URLs. Essentially, this is a graph traversal problem where each URL represents a node, and links between them are directed edges:

  1. Host Extraction: Extract the hostname from the startUrl to use as a comparison reference for all other URLs.

  2. Traversal Strategy: Use either Breadth-First Search (BFS) or Depth-First Search (DFS) to crawl connected URLs:

    • Skip any URL whose hostname does not match the startUrl’s hostname.
    • Skip any URL already visited to avoid cycles or redundant requests.
  3. Visited Set: Use a Set to store visited URLs to ensure no duplicate visits.

  4. Fetching Links: Use the HtmlParser.getUrls(url) function to fetch child links from the current URL.

Example Walkthrough

Starting from:

"http://news.example.com/news/topics/"

We retrieve:

→ "http://news.example.com"
→ "http://news.example.com/news"

Further traversal leads to:

→ "http://news.example.com/us"

At no point do we include:

"http://news.google.com"

Because the hostname is different.

Solutions

  • C++
cpp
class Solution {
public:
    vector<string> extractUrls(string initialUrl, HtmlParser parser) {
        function<string(string)> extractHost = [](string url) -> string {
            int cutPoint = min(url.size(), url.find('/', 7));
            return url.substr(7, cutPoint - 7);
        };
    
        queue<string> urlsToVisit;
        urlsToVisit.push(initialUrl);
        unordered_set<string> visitedUrls{initialUrl};
        string rootHost = extractHost(initialUrl);
        while (!urlsToVisit.empty()) {
            string currentUrl = urlsToVisit.front();
            urlsToVisit.pop();
            for (string newUrl : parser.getUrls(currentUrl)) {
                if (extractHost(newUrl) == rootHost && !visitedUrls.count(newUrl)) {
                    urlsToVisit.push(newUrl);
                    visitedUrls.insert(newUrl);
                }
            }
        }
        return vector<string>(visitedUrls.begin(), visitedUrls.end());
    }
};

The given C++ code defines a web crawling functionality designed to fetch URLs from a given initial URL. It leverages a breadth-first search technique to explore and collect URLs that belong to the same web host as the initial URL. Here's how the code works:

  1. Define an inner lambda function to extract the hostname from a URL. This function finds the first occurrence of '/' after the HTTP/HTTPS protocol indication part and extracts the portion between 'http://' or 'https://' and this '/'. It operates only on strings that begin with the protocol.

  2. Initialize a queue to manage URLs for visiting, and an unordered set to keep track of visited URLs. Add the initial URL to both collections.

  3. Calculate the host of the initial URL using the previously defined hostname extraction function.

  4. Enter a loop that continues as long as there are URLs left to visit:

    • Extract and remove the URL at the front of the queue.

    • Use the provided parser to fetch new URLs from the current URL.

    • For each new URL, check if its host matches the initial URL's host and if it has not been visited. If both conditions are met, add this URL to both the queue and visited URL set.

  5. Once the queue is empty, convert the set of visited URLs to a vector and return it.

This code effectively collects all URLs that can be navigated to from the initial URL, restricted to those that share the same domain name, ensuring that only relevant and local links are considered. The breadth-first approach ensures that closer links are explored before moving deeper into the link network.

  • Java
java
class Solution {
    private String extractHost(String urlString) {
        return urlString.split("/")[2];
    }
    
    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        String rootHost = extractHost(startUrl);
        Queue<String> urlQueue = new LinkedList<>(Arrays.asList(startUrl));
        HashSet<String> seenUrls = new HashSet<>(Arrays.asList(startUrl));
        while (!urlQueue.isEmpty()) {
            String currentUrl = urlQueue.poll();
            for (String neighborUrl : htmlParser.getUrls(currentUrl)) {
                if (extractHost(neighborUrl).equals(rootHost) && !seenUrls.contains(neighborUrl)) {
                    urlQueue.offer(neighborUrl);
                    seenUrls.add(neighborUrl);
                }
            }
        }
        return new ArrayList<>(seenUrls);
    }
}

The provided Java code implements a simple web crawler using breadth-first search (BFS). Here's how the solution works:

  • Host Extraction: A private method extractHost retrieves the hostname from a given URL by splitting the URL string and selecting the appropriate part.
  • Crawler Initialization: The crawl method initializes with a starting URL. It employs a LinkedList as a queue (urlQueue) to manage the URLs to be visited and a HashSet (seenUrls) to keep track of visited URLs.
  • BFS Implementation: The while loop processes each URL in the queue:
    • It dequeues the current URL and retrieves all linked URLs (neighbors) using the htmlParser.getUrls method.
    • For each neighboring URL, the method:
      • Checks if the neighbor's host matches the root's host using extractHost.
      • Ensures the URL hasn't been visited before.
      • If both conditions are met, the URL is added to both the queue for further exploration and the set of seen URLs.
  • Result Compilation: Once the queue is empty, the method converts seenUrls into a list and returns it, representing all the URLs on the same host that can be reached from the start URL.

This BFS-based approach ensures that all reachable URLs from the start URL within the same domain are visited without repetition, providing an efficient way to crawl a subset of the web defined by the starting host. The use of a queue and a set efficiently manages the expanding frontier of the crawl and the history of visited URLs respectively.

  • Python
python
class Solution:
    def web_crawler(self, initialUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def extract_host(url):
            return url.split('/')[2]
    
        host_of_start = extract_host(initialUrl)
        queue = collections.deque([initialUrl])
        seen = set([initialUrl])
        while queue:
            current_url = queue.popleft()
            for adjacent_url in htmlParser.getUrls(current_url):
                if extract_host(adjacent_url) == host_of_start and adjacent_url not in seen:
                    queue.append(adjacent_url)
                    seen.add(adjacent_url)
        return seen

This solution summary explains a Python program that functions as a web crawler, designed to traverse and collect URLs starting from an initial URL.

  • Initialize a function web_crawler within the Solution class, which takes two parameters: initialUrl, the URL from where the crawl begins, and htmlParser, an object which can parse URLs from a given webpage.
  • Define an inner function extract_host within web_crawler to extract the host part of the URL, aiding in determining if a URL belongs to the same domain as the initial URL.
  • Store the host of the initial URL in host_of_start.
  • Utilize a queue to manage URLs to be crawled, starting with the initialUrl.
  • Create a set seen to keep track of visited URLs, preventing the processing of the same URL multiple times.
  • Process URLs from the queue one by one in a loop:
    • Retrieve URLs linked on the current page using the htmlParser.getUrls method.
    • Filter and add URLs to the queue and seen set if they belong to the same host as the initial URL and have not been visited yet.
  • Return the seen set, which contains all unique URLs crawled that are from the same domain as the initial URL.

By employing collections for effective queue management and a set to avoid revisiting URLs, the program ensures efficient crawling, adhering closely to typical breadth-first search algorithms used in web crawlers.

Comments

No comments yet.