1062. Longest Repeating Substring - Solutions and Explanation

Longest Repeating Substring header image

Problem Statement

The challenge presented requires determining the length of the longest repeating substring within a given string s. If s does not contain any repeating substrings, the function should return 0. A repeating substring can be described as a sequence of characters that appears more than once in the string, and the goal is to identify the length of the longest sequence of this nature.

Examples

Example 1

Input:

s = "abcd"

Output:

Explanation:

There is no repeating substring.

Example 2

Input:

s = "abbaba"

Output:

Explanation:

The longest repeating substrings are "ab" and "ba", each of which occurs twice.

Example 3

Input:

s = "aabcaabdaab"

Output:

Explanation:

The longest repeating substring is "aab", which occurs 3 times.

Constraints

1 <= s.length <= 2000
s consists of lowercase English letters.

Approach and Intuition

The problem of finding the longest repeating substring in a string combines elements of string manipulation and search techniques. From the provided examples and constraints, we can glean some insights and potential approaches:

A Brute Force approach would involve generating all possible substrings and checking for repetitions, which would be computationally expensive given s can be up to 2000 characters long.
Optimizations could include using binary search to find the "longest" feasible substring length, combined with some form of hashing or searching technique like the Rabin-Karp algorithm for efficient substring comparison.
As evident from the examples:
- For s = "abcd", there are no substrings that repeat, leading to a return of 0.
- For s = "abbaba", the longest repeating substrings are "ab" and "ba", each of two characters in length, hence the function returns 2.
- In s = "aabcaabdaab", the three-character substring "aab" repeats, and thus the function returns 3 as the length of the longest repeating substrings.
The constraints:
- The maximum length for s is 2000, suggesting that while a solution must be efficient, particularly in terms of time complexity, there is some processing power available.
- Only lowercase English letters will be present, simplifying the need to handle a broad charset, and allowing for more straightforward hashing or array index-based solutions.

Solutions

C++
Java
Python

                            cpp
                            
                        
class Solution {
public:
    int findLongestRepeatingSubstring(string s) {
        int n = s.size();
        vector<string> subs(n);
            
        // Store all suffixes of the input string
        for (int i = 0; i < n; i++) {
            subs[i] = s.substr(i);
        }
    
        // Apply MSD radix sort to the suffixes
        radixSort(subs);
    
        int maxLen = 0;
        // Find the longest common prefix between consecutive suffixes
        for (int i = 1; i < n; i++) {
            int cnt = 0;
            // Compare characters of adjacent suffixes
            while (cnt < min(subs[i].size(), subs[i - 1].size()) &&
                   subs[i][cnt] == subs[i - 1][cnt]) {
                cnt++;
            }
            // Update the maximum prefix length found
            maxLen = max(maxLen, cnt);
        }
        return maxLen;  // Return the length of the longest repeating substring
    }
    
private:
    void radixSort(vector<string>& arr) {
        vector<string> temp(arr.size());
        stableSort(arr, 0, arr.size() - 1, 0, temp);
    }
    
    void stableSort(vector<string>& arr, int low, int high, int depth,
                    vector<string>& temp) {
        if (low >= high) return;
    
        vector<int> count(28, 0);
        // Count characters at 'depth'
        for (int i = low; i <= high; i++) {
            count[charAt(arr[i], depth) + 1]++;
        }
        // Calculate positions
        for (int i = 1; i < 28; i++) {
            count[i] += count[i - 1];
        }
        // Distribute records
        for (int i = low; i <= high; i++) {
            temp[count[charAt(arr[i], depth)]++] = arr[i];
        }
        // Copy back to arr
        for (int i = low; i <= high; i++) {
            arr[i] = temp[i - low];
        }
        // Recursive subarray sorts
        for (int i = 0; i < 27; i++) {
            stableSort(arr, low + count[i], low + count[i + 1] - 1, depth + 1, temp);
        }
    }
    
    int charAt(const string& s, int d) {
        if (d >= s.size()) return 0;
        return s[d] - 'a' + 1;
    }
};

The solution for finding the longest repeating substring in C++ leverages suffix sorting and comparison. Here's a quick breakdown of the process:

Suffix Array Creation: Starts by generating all suffixes of the given string and storing them in an array.
Sorting: Uses a most significant digit (MSD) radix sort algorithm to lexicographically sort the suffixes of the string.
Common Prefix Length Measurement: Iterates over the sorted suffixes to identify the length of the longest common prefix between each pair of consecutive suffixes.
Result Calculation: Captures and returns the maximum length of these common prefixes as the length of the longest repeating substring.

Detailed Steps within the Functions:

findLongestRepeatingSubstring:
- Initializes an array to store the suffixes.
- Sorts the suffixes using the radix sort method.
- Calculates the longest repeating substring by comparing adjacent sorted suffixes.
radixSort:
- Utilizes the stableSort function recursively which organizes the suffixes by comparing characters incrementally, ensuring stability and lexicographic order.
stableSort:
- Counts occurrences of each character at the current depth in all suffixes using an auxiliary array.
- Sorts the suffixes based on these counts and recursively sorts the resultant buckets, increasing the depth each time to achieve complete sorting.

This approach efficiently handles the substring search by sorting and then simply comparing adjacent elements in the sorted array, rather than engaging in a more naive or brute force comparison of all possible substrings.

                            java
                            
                        
class Solution {
    
    public int longestRepeatedSubstring(String str) {
        int length = str.length();
        String[] suffixes = new String[length];
    
        // Generating suffix array
        for (int i = 0; i < length; i++) {
            suffixes[i] = str.substring(i);
        }
        // Implementing MSD Radix Sort on suffixes
        radixSortMSD(suffixes);
    
        int maxLcpLength = 0;
        // Identifying longest common prefix among sorted suffixes
        for (int i = 1; i < length; i++) {
            int lcp = 0;
            while (
                lcp < Math.min(suffixes[i].length(), suffixes[i - 1].length()) &&
                suffixes[i].charAt(lcp) == suffixes[i - 1].charAt(lcp)
            ) {
                lcp++;
            }
            maxLcpLength = Math.max(maxLcpLength, lcp);
        }
        return maxLcpLength;
    }
    
    // Method to apply MSD Radix Sort on suffix array
    private void radixSortMSD(String[] array) {
        msdSort(array, 0, array.length - 1, 0, new String[array.length]);
    }
    
    // Helper sort method for MSD radix sorting
    private void msdSort(String[] array, int lo, int hi, int d, String[] temp) {
        if (lo >= hi) return;
    
        int[] count = new int[28];
        for (int i = lo; i <= hi; i++) {
            count[charPosition(array[i], d) + 1]++;
        }
        for (int i = 1; i < 28; i++) {
            count[i] += count[i - 1];
        }
        for (int i = lo; i <= hi; i++) {
            temp[count[charPosition(array[i], d)]++] = array[i];
        }
        for (int i = lo; i <= hi; i++) {
            array[i] = temp[i - lo];
        }
        for (int i = 0; i < 27; i++) {
            msdSort(array, lo + count[i], lo + count[i + 1] - 1, d + 1, temp);
        }
    }
    
    // Fetch character position or return 0 for bounds
    private int charPosition(String s, int index) {
        if (index >= s.length()) return 0;
        return s.charAt(index) - 'a' + 1;
    }
}

In solving the problem of finding the length of the longest repeating substring in Java, the approach involves constructing a suffix array and implementing radix sort algorithms. Here's the breakdown:

Start by extracting each possible suffix from the given input string and developing a suffix array from these suffixes.
Employ the most significant digit (MSD) radix sort to organize the suffixes lexicographically. This sort handles sorting based on multiple characters of the strings.
Once sorted, identify the longest common prefix (LCP) among consecutive entries in the sorted suffix list. This reveals the longest repeating substring length by comparing adjacent suffixes and counting maximal identical initial segments.
Efficiently manage sorting using recursive helper methods that handle array partitions based on character positions, utilizing temporary arrays and counting distributions for sorting characters at specific positions.

By following these methods, the radix sort effectively groups substrings facilitating the efficient discovery of the longest recurring sequence within the string. The overall logic and flow remain clear, ensuring optimal performance and readability of the solution.

                            python
                            
                        
class Solution:
    def findLongestRepeatedSubstring(self, input_str: str) -> int:
        n = len(input_str)
        substrings = []
    
        # Construct array with all possible substrings
        for start in range(n):
            substrings.append(input_str[start:])
        # Sort these substrings efficiently
        self._sort_substrings(substrings)
    
        longest = 0
        # Determine longest repeated substring by comparing adjacent
        for idx in range(1, n):
            temp_length = 0
            # Compare character by character
            while (
                temp_length < min(len(substrings[idx]), len(substrings[idx - 1]))
                and substrings[idx][temp_length] == substrings[idx - 1][temp_length]
            ):
                temp_length += 1
            # Record the length of longest match
            longest = max(longest, temp_length)
        return longest
    
    def _sort_substrings(self, array: List[str]) -> None:
        temporary = ["" for _ in array]
        self._msd_sort_helper(array, 0, len(array) - 1, 0, temporary)
    
    def _msd_sort_helper(
        self, array: List[str], start: int, end: int, char_index: int, temp: List[str]
    ) -> None:
        if start >= end:
            return
    
        # Initialize character bucket sizes (incl. pseudo-EOF at index 0)
        buckets = [0] * 28
            
        # Count frequency of each character at the current depth
        for idx in range(start, end + 1):
            buckets[self._char_idx(array[idx], char_index) + 1] += 1
    
        # Accumulate bucket sizes to get positions
        for i in range(1, 28):
            buckets[i] += buckets[i - 1]
    
        # Move items into place in the temp array
        for idx in range(start, end + 1):
            position = self._char_idx(array[idx], char_index)
            temp[buckets[position]] = array[idx]
            buckets[position] += 1
    
        # Copy sorted items back into original array
        for idx in range(start, end + 1):
            array[idx] = temp[idx - start]
    
        # Recursively sort individual buckets
        for j in range(27):
            self._msd_sort_helper(
                array, start + buckets[j], start + buckets[j + 1] - 1, char_index + 1, temp
            )
    
    def _char_idx(self, s: str, index: int) -> int:
        # 0 for out-of-bounds (acts as string end marker)
        if index >= len(s):
            return 0
        return ord(s[index]) - ord("a") + 1

The Python3 program you're working with aims to identify the longest repeating substring within a given string by utilizing a sophisticated method based on comparison of suffixes and sorting. Follow these insights to understand the solution:

The central task of the program is to process an input string and identify the length of the longest repeating substring present in it. The main class Solution has two methods to accomplish this: findLongestRepeatedSubstring and _sort_substrings.
In findLongestRepeatedSubstring, an array substrings is created to hold all possible subsequent substrings of the input. These substrings are constructed in a loop and then sorted by invoking _sort_substrings.
The _sort_substrings method uses most significant digit (MSD) sorting, well-suited for string sorting, employing a temporary array and a character-index mapping.
After sorting, adjacent substrings are compared character by character to ascertain the length of the highest common prefix, which would indicate a repeating sequence. The longest such sequence length is tracked and finally returned by the program.
The detailed sorting mechanism involves first counting characters using a bucket array that effectively segregates substrings based on their current character. It essentially rearranges substrings according to character ranks at the indexed position, and then sorts within these categorizations recursively, deepening to successively higher character positions.

Understanding this solution leverages knowledge of suffix arrays, string sorting, and recursive subdivision of data, which enables efficient identification of repeating substrings amid large or complex datasets. This approach guarantees that the solution not only finds but also confirms the maximal repeating sequences in the string by direct comparison of ordered substring candidates.

Comments

No comments yet.

Longest Repeating Substring

Problem Statement

Examples

Example 1

Example 2

Example 3

Constraints

Approach and Intuition

Solutions

Comments

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs