
Problem Statement
In this problem, you are provided with a string s
. Your task is to identify the longest section of the string that appears more than once, known as a duplicated substring. The substrings can be contiguous and their occurrences may overlap within the string.
The challenge is to find the longest such duplicated substring. If no such substrings exist (i.e., each character sequence appears only once), your function should return an empty string ""
.
Examples
Example 1
Input:
s = "banana"
Output:
"ana"
Example 2
Input:
s = "abcd"
Output:
""
Constraints
2 <= s.length <= 3 * 104
s
consists of lowercase English letters.
Approach and Intuition
To tackle the problem of finding the longest duplicated substring, let's consider the integral parts of our solution:
Understanding Overlapping Occurrences:
- Substrings can overlap, for instance, "ana" appears twice in "banana" with an overlapping character. This suggests that a simple scanning technique, while iterating, should account for overlaps correctly by restarting the search just after the beginning of the last found substring.
Brute Force Method:
- One possible approach is to generate all possible substrings of
s
and for each substring check if it appears more than once. However, this would be computationally expensive, especially with the maximum length ofs
being 30000.
- One possible approach is to generate all possible substrings of
Efficient Search Techniques:
- More effective methods like binary search combined with hashing (Rabin-Karp algorithm) could be utilized. The idea is to use binary search to guess the length of the longest duplicated substring, and then check if a substring of this length occurs more than once using a rolling hash.
Hashing for Quick Comparisons:
- Applying a hash function to string substrings allows for quick comparisons of these substrings in constant time, which can drastically reduce the search space and time complexity.
Edge Cases:
- With the minimum length of
s
being 2, the simplest case involves quickly checking if both characters are identical. - For strings not containing any repeated substrings, ensure the routine correctly returns an empty string.
- With the minimum length of
This tactical blend of binary search to determine the maximum length and hashing for efficient substring comparison forms a potent solution to identify the longest duplicated substring efficiently. Implementations must handle edge scenarios where the string length is at the boundaries of its constraints or where no duplicated substrings exist.
Solutions
- Java
- Python
class Solution {
String inputStr;
public int searchSubstr(int substrLen, int base, long mod, int strLen, int[] inputNums) {
long hash = 0;
for (int i = 0; i < substrLen; ++i) {
hash = (hash * base + inputNums[i]) % mod;
}
HashMap<Long, List<Integer>> hashStore = new HashMap<>();
hashStore.putIfAbsent(hash, new ArrayList<>());
hashStore.get(hash).add(0);
long baseL = 1;
for (int i = 1; i <= substrLen; ++i) {
baseL = (baseL * base) % mod;
}
for (int start = 1; start < strLen - substrLen + 1; ++start) {
hash = (hash * base - inputNums[start - 1] * baseL % mod + mod) % mod;
hash = (hash + inputNums[start + substrLen - 1]) % mod;
List<Integer> positions = hashStore.get(hash);
if (positions != null) {
String current = inputStr.substring(start, start + substrLen);
for (Integer pos : positions) {
String exist = inputStr.substring(pos, pos + substrLen);
if (exist.equals(current)) {
return pos;
}
}
}
hashStore.putIfAbsent(hash, new ArrayList<>());
hashStore.get(hash).add(start);
}
return -1;
}
public String findLongestRepeatedSubstring(String str) {
inputStr = str;
int len = inputStr.length();
int[] numList = new int[len];
for (int i = 0; i < len; ++i) {
numList[i] = (int)inputStr.charAt(i) - (int)'a';
}
int base = 26;
long mod = 1000000007;
int low = 1, high = len;
while (low <= high) {
int mid = low + (high - low) / 2;
if (searchSubstr(mid, base, mod, len, numList) != -1) {
low = mid + 1;
} else {
high = mid - 1;
}
}
int startIndex = searchSubstr(low - 1, base, mod, len, numList);
return inputStr.substring(startIndex, startIndex + low - 1);
}
}
This Java solution is designed to solve the problem of finding the longest duplicate substring in a given string. The approach taken here utilizes the Rabin-Karp algorithm, enhanced with binary search over the length of possible substrings, for efficiency.
Start by defining a helper function
searchSubstr
that searches for a substring of a specific length using a hash-based approach:- Calculate the hash value for each substring of the target length and store these in a hash table with starting indices.
- Move a sliding window across the string to update the hash iteratively and check for duplicates using the hash values.
The main method
findLongestRepeatedSubstring
utilizes the helper function as follows:- Convert each character of the string to an integer representation.
- Define a large prime number as the modulus to avoid hash overflow and choose 26 as the base, simulating a character set.
- Employ binary search to find the maximum length of the substring that appears more than once:
- Check substrings of middle length. If a duplicate exists, move to longer potential substrings. If not, try shorter lengths.
- After identifying the maximum substring length with duplicates, reapply the
searchSubstr
to get the exact starting index of the substring.
The solution finally returns the longest duplicated substring after calculating the appropriate starting index and substring length identified by the binary search process.
This technique is particularly effective in dealing with large input sizes due to its logarithmic search over lengths and linear check over the string with hash computations, making it fit for avoiding direct comparisons of potentially long substrings multiple times.
class StringAnalyzer:
def find_duplicate_substring(self, sub_length: int, base: int, hash_mod: int, total_length: int, char_value_list: List[int]) -> str:
# Initial hash calculation for the required substring length
substring_hash = 0
for i in range(sub_length):
substring_hash = (substring_hash * base + char_value_list[i]) % hash_mod
# Store hashes of seen substrings
observed_hashes = collections.defaultdict(list)
observed_hashes[substring_hash].append(0)
# Precompute base^sub_length % hash_mod
baseL = pow(base, sub_length, hash_mod)
for offset in range(1, total_length - sub_length + 1):
# Rolling hash calculation
substring_hash = (substring_hash * base - char_value_list[offset - 1] * baseL + char_value_list[offset + sub_length - 1]) % hash_mod
if substring_hash in observed_hashes:
current_substr = char_value_list[offset : offset + sub_length]
if any(current_substr == char_value_list[position : position + sub_length] for position in observed_hashes[substring_hash]):
return offset
observed_hashes[substring_hash].append(offset)
return -1
def longest_repeated_substring(self, input_text: str) -> str:
# Hash modulus
mod_value = 10**9 + 7
# Base for the hash
base_multiplier = 26
length = len(input_text)
# Converting characters to integer
char_values = [ord(input_text[i]) - ord('a') for i in range(length)]
# Binary search to find the maximum length of duplicated substring
best_start = -1
left_ptr, right_ptr = 1, length - 1
while left_ptr <= right_ptr:
mid = left_ptr + (right_ptr - left_ptr) // 2
duplicate_start = self.find_duplicate_substring(mid, base_multiplier, mod_value, length, char_values)
if duplicate_start != -1:
left_ptr = mid + 1
best_start = duplicate_start
else:
right_ptr = mid - 1
# Extract the longest duplicated substring using the start index found
return input_text[best_start : best_start + left_ptr - 1]
The provided Python code defines a class StringAnalyzer
that contains methods to identify the longest duplicated substring in a given string. Here's how the solution is structured:
- The method
find_duplicate_substring
is used to determine if there exists a duplicate substring of a givensub_length
. It employs a rolling hash mechanism and uses a binary search strategy to find duplicates efficiently. - The main method
longest_repeated_substring
initializes necessary variables such as the hash base and modulus, and it maps characters to integer values. This method applies a binary search to pinpoint the maximum length of a duplicated substring in the input text. - Hashes for substrings are stored and updated using the rolling hash formula to check for duplicates as the window slides over the input string. If a repeated substring hash is detected, it checks if the substrings at the stored positions match the current substring exactly.
- If a duplicate substring is found during the binary search, the left pointer is moved to the right to search for a longer duplicate substring, otherwise, the right pointer is moved to the left, reducing the search range.
- The method concludes by extracting and returning the longest repeated substring from the original text using the determined starting position and length, which is calculated from the binary search's left pointer.
This solution efficiently handles the search for the longest duplicate substring by combining hashing, rolling hash, and binary search, making it suitable for large strings while ensuring collision handling through substring comparison.
No comments yet.