Longest Duplicate Substring

Updated on 13 June, 2025
Longest Duplicate Substring header image

Problem Statement

In this problem, you are provided with a string s. Your task is to identify the longest section of the string that appears more than once, known as a duplicated substring. The substrings can be contiguous and their occurrences may overlap within the string.

The challenge is to find the longest such duplicated substring. If no such substrings exist (i.e., each character sequence appears only once), your function should return an empty string "".

Examples

Example 1

Input:

s = "banana"

Output:

"ana"

Example 2

Input:

s = "abcd"

Output:

""

Constraints

  • 2 <= s.length <= 3 * 104
  • s consists of lowercase English letters.

Approach and Intuition

To tackle the problem of finding the longest duplicated substring, let's consider the integral parts of our solution:

  1. Understanding Overlapping Occurrences:

    • Substrings can overlap, for instance, "ana" appears twice in "banana" with an overlapping character. This suggests that a simple scanning technique, while iterating, should account for overlaps correctly by restarting the search just after the beginning of the last found substring.
  2. Brute Force Method:

    • One possible approach is to generate all possible substrings of s and for each substring check if it appears more than once. However, this would be computationally expensive, especially with the maximum length of s being 30000.
  3. Efficient Search Techniques:

    • More effective methods like binary search combined with hashing (Rabin-Karp algorithm) could be utilized. The idea is to use binary search to guess the length of the longest duplicated substring, and then check if a substring of this length occurs more than once using a rolling hash.
  4. Hashing for Quick Comparisons:

    • Applying a hash function to string substrings allows for quick comparisons of these substrings in constant time, which can drastically reduce the search space and time complexity.
  5. Edge Cases:

    • With the minimum length of s being 2, the simplest case involves quickly checking if both characters are identical.
    • For strings not containing any repeated substrings, ensure the routine correctly returns an empty string.

This tactical blend of binary search to determine the maximum length and hashing for efficient substring comparison forms a potent solution to identify the longest duplicated substring efficiently. Implementations must handle edge scenarios where the string length is at the boundaries of its constraints or where no duplicated substrings exist.

Solutions

  • Java
  • Python
java
class Solution {
    String inputStr;

    public int searchSubstr(int substrLen, int base, long mod, int strLen, int[] inputNums) {
        long hash = 0;
        for (int i = 0; i < substrLen; ++i) {
            hash = (hash * base + inputNums[i]) % mod;
        }

        HashMap<Long, List<Integer>> hashStore = new HashMap<>();
        hashStore.putIfAbsent(hash, new ArrayList<>());
        hashStore.get(hash).add(0);

        long baseL = 1;
        for (int i = 1; i <= substrLen; ++i) {
            baseL = (baseL * base) % mod;
        }

        for (int start = 1; start < strLen - substrLen + 1; ++start) {
            hash = (hash * base - inputNums[start - 1] * baseL % mod + mod) % mod;
            hash = (hash + inputNums[start + substrLen - 1]) % mod;
            List<Integer> positions = hashStore.get(hash);
            if (positions != null) {
                String current = inputStr.substring(start, start + substrLen);
                for (Integer pos : positions) {
                    String exist = inputStr.substring(pos, pos + substrLen);
                    if (exist.equals(current)) {
                        return pos;
                    }
                }
            }
            hashStore.putIfAbsent(hash, new ArrayList<>());
            hashStore.get(hash).add(start);
        }
        return -1;
    }

    public String findLongestRepeatedSubstring(String str) {
        inputStr = str;
        int len = inputStr.length();
        
        int[] numList = new int[len];
        for (int i = 0; i < len; ++i) {
            numList[i] = (int)inputStr.charAt(i) - (int)'a';
        }

        int base = 26;
        long mod = 1000000007;

        int low = 1, high = len;
        while (low <= high) {
            int mid = low + (high - low) / 2;
            if (searchSubstr(mid, base, mod, len, numList) != -1) {
                low = mid + 1;
            } else {
                high = mid - 1;
            }
        }

        int startIndex = searchSubstr(low - 1, base, mod, len, numList);
        return inputStr.substring(startIndex, startIndex + low - 1);
    }
}

This Java solution is designed to solve the problem of finding the longest duplicate substring in a given string. The approach taken here utilizes the Rabin-Karp algorithm, enhanced with binary search over the length of possible substrings, for efficiency.

  • Start by defining a helper function searchSubstr that searches for a substring of a specific length using a hash-based approach:

    1. Calculate the hash value for each substring of the target length and store these in a hash table with starting indices.
    2. Move a sliding window across the string to update the hash iteratively and check for duplicates using the hash values.
  • The main method findLongestRepeatedSubstring utilizes the helper function as follows:

    1. Convert each character of the string to an integer representation.
    2. Define a large prime number as the modulus to avoid hash overflow and choose 26 as the base, simulating a character set.
    3. Employ binary search to find the maximum length of the substring that appears more than once:
      • Check substrings of middle length. If a duplicate exists, move to longer potential substrings. If not, try shorter lengths.
    4. After identifying the maximum substring length with duplicates, reapply the searchSubstr to get the exact starting index of the substring.
  • The solution finally returns the longest duplicated substring after calculating the appropriate starting index and substring length identified by the binary search process.

This technique is particularly effective in dealing with large input sizes due to its logarithmic search over lengths and linear check over the string with hash computations, making it fit for avoiding direct comparisons of potentially long substrings multiple times.

python
class StringAnalyzer:
    def find_duplicate_substring(self, sub_length: int, base: int, hash_mod: int, total_length: int, char_value_list: List[int]) -> str:
        # Initial hash calculation for the required substring length
        substring_hash = 0
        for i in range(sub_length):
            substring_hash = (substring_hash * base + char_value_list[i]) % hash_mod
              
        # Store hashes of seen substrings
        observed_hashes = collections.defaultdict(list)
        observed_hashes[substring_hash].append(0)
        
        # Precompute base^sub_length % hash_mod
        baseL = pow(base, sub_length, hash_mod)
        
        for offset in range(1, total_length - sub_length + 1):
            # Rolling hash calculation
            substring_hash = (substring_hash * base - char_value_list[offset - 1] * baseL + char_value_list[offset + sub_length - 1]) % hash_mod
            if substring_hash in observed_hashes:
                current_substr = char_value_list[offset : offset + sub_length]
                if any(current_substr == char_value_list[position : position + sub_length] for position in observed_hashes[substring_hash]):
                    return offset
            observed_hashes[substring_hash].append(offset)
        return -1
        
    def longest_repeated_substring(self, input_text: str) -> str:
        # Hash modulus
        mod_value = 10**9 + 7
        
        # Base for the hash
        base_multiplier = 26
        length = len(input_text)
        
        # Converting characters to integer
        char_values = [ord(input_text[i]) - ord('a') for i in range(length)]
        
        # Binary search to find the maximum length of duplicated substring
        best_start = -1
        left_ptr, right_ptr = 1, length - 1
        while left_ptr <= right_ptr:
            mid = left_ptr + (right_ptr - left_ptr) // 2
            duplicate_start = self.find_duplicate_substring(mid, base_multiplier, mod_value, length, char_values)
            
            if duplicate_start != -1:
                left_ptr = mid + 1
                best_start = duplicate_start
            else:
                right_ptr = mid - 1
        
        # Extract the longest duplicated substring using the start index found
        return input_text[best_start : best_start + left_ptr - 1]

The provided Python code defines a class StringAnalyzer that contains methods to identify the longest duplicated substring in a given string. Here's how the solution is structured:

  • The method find_duplicate_substring is used to determine if there exists a duplicate substring of a given sub_length. It employs a rolling hash mechanism and uses a binary search strategy to find duplicates efficiently.
  • The main method longest_repeated_substring initializes necessary variables such as the hash base and modulus, and it maps characters to integer values. This method applies a binary search to pinpoint the maximum length of a duplicated substring in the input text.
  • Hashes for substrings are stored and updated using the rolling hash formula to check for duplicates as the window slides over the input string. If a repeated substring hash is detected, it checks if the substrings at the stored positions match the current substring exactly.
  • If a duplicate substring is found during the binary search, the left pointer is moved to the right to search for a longer duplicate substring, otherwise, the right pointer is moved to the left, reducing the search range.
  • The method concludes by extracting and returning the longest repeated substring from the original text using the determined starting position and length, which is calculated from the binary search's left pointer.

This solution efficiently handles the search for the longest duplicate substring by combining hashing, rolling hash, and binary search, making it suitable for large strings while ensuring collision handling through substring comparison.

Comments

No comments yet.