
Problem Statement
In this task, you are given two strings, needle
and haystack
, and are required to determine the first position at which the substring needle
appears fully within the string haystack
. If the substring needle
cannot be located within haystack
at all, the function should return an index of -1
. This simple yet foundational task in string searching algorithms animates many real-world applications like text editing tools, search engines, and data processing systems, where such pattern matching is crucial for efficient operations.
Examples
Example 1
Input:
haystack = "sadbutsad", needle = "sad"
Output:
0
Explanation:
"sad" occurs at index 0 and 6. The first occurrence is at index 0, so we return 0.
Example 2
Input:
haystack = "vultrcode", needle = "vultro"
Output:
-1
Explanation:
"vultro" did not occur in "vultrcode", so we return -1.
Constraints
1 <= haystack.length, needle.length <= 104
haystack
andneedle
consist of only lowercase English characters.
Approach and Intuition
The problem of finding a substring within a string can be approached using multiple strategies. To flesh out the intuition and the applicable solution strategies based on the examples provided along with the constraints:
Brute Force Approach:
- Iterate over the
haystack
using a sliding window of size equal to the length ofneedle
. - For each position in
haystack
, compare the substring of length equal toneedle
toneedle
itself. - If a match occurs, that starting index is the desired result.
- If no such match is found throughout the loop, return
-1
.
- Iterate over the
Optimized Algorithms:
- Beyond the brute force method, more sophisticated string searching algorithms exist, such as the Knuth-Morris-Pratt algorithm, the Boyer-Moore algorithm, and the Rabin-Karp algorithm.
- These algorithms generally improve on the worst-case time complexity of the brute force approach by preprocessing
needle
or utilizing information gathered during the initial failed matches to skip sections ofhaystack
, thus reducing the number of comparisons necessary.
Examples from the problem statement can help understand this:
Example 1:
haystack = "sadbutsad", needle = "sad"
- The substring "sad" appears right at the beginning of "sadbutsad". The brute force method would identify "sad" at index 0 of "sadbutsad" without needing to check further substrings.
Example 2:
haystack = "vultrcode", needle = "vultro"
- Here, the substring "vultro" does not appear anywhere in "vultrcode". As we slide through "vultrcode", no substring matches "vultro". Therefore, the search concludes with
-1
.
Application of Constraints:
- Since the length of
haystack
andneedle
can go up to 104, efficiency matters, but the brute force method is often feasible for moderate sizes, particularly whenneedle
is small compared tohaystack
. - The constraint that both
haystack
andneedle
consist of only lowercase English characters simplifies the comparison process as there is no need to handle case sensitivity or a wider range of characters.
These points outline not just an approach but a clear plan on tackling this problem using straightforward programming techniques with considerations for efficiency improvements.
Solutions
- C++
- Java
- C
- JavaScript
- Python
class Solution {
public:
int findSubstring(string txt, string pat) {
int patLen = pat.size();
int txtLen = txt.size();
if (txtLen < patLen) return -1;
vector<int> lps(patLen);
int j = 0;
int i = 1;
while (i < patLen) {
if (pat[i] == pat[j]) {
j += 1;
lps[i] = j;
i += 1;
} else {
if (j == 0) {
lps[i] = 0;
i += 1;
} else {
j = lps[j - 1];
}
}
}
int txtIndex = 0;
int patIndex = 0;
while (txtIndex < txtLen) {
if (txt[txtIndex] == pat[patIndex]) {
patIndex += 1;
txtIndex += 1;
if (patIndex == patLen) {
return txtIndex - patLen;
}
} else {
if (patIndex == 0) {
txtIndex += 1;
} else {
patIndex = lps[patIndex - 1];
}
}
}
return -1;
}
};
This guide provides an explanation of a C++ solution method to determine the index of the first occurrence of a substring (pat) within a given string (txt). The solution employs the Knuth-Morris-Pratt (KMP) algorithm, which is efficient for this task because it skips unnecessary comparisons.
- The function
findSubstring
accepts two strings,txt
(the main string) andpat
(the pattern or substring to find). - Start by obtaining the lengths of both
txt
andpat
. Iftxt
is shorter thanpat
, return -1 as the substring cannot be present. - Prepare a vector
lps
to store the longest prefix suffix values for patternpat
. This vector is crucial for skipping unnecessary checks in case of a mismatch. - Initialize variables
i
andj
to manage the current position inpat
while filling thelps
array. Iterate throughpat
to calculate thelps
values. - With the
lps
array ready, use two pointers,txtIndex
andpatIndex
, initialized to zero. These pointers traverse throughtxt
andpat
, respectively. - If characters at
txt[txtIndex]
andpat[patIndex]
match, move both pointers forward. IfpatIndex
reaches the length ofpat
, a match has been found, and return the start index of this match intxt
. - Upon a mismatch, if
patIndex
is not zero, update it using the last knownlps
value (this represents the next position of the pattern to check). IfpatIndex
is zero, simply movetxtIndex
forward to continue checking. - If you exhaust
txt
and find no match, return -1.
The function finishes by returning either the starting index of the first match or -1 if no match is found, thus efficiently solving the problem using the KMP algorithm.
class Solution {
public int findSubstring(String haystack, String needle) {
int needleLen = needle.length();
int haystackLen = haystack.length();
if (haystackLen < needleLen) return -1;
// PREFIX TABLE COMPUTATION
int[] lps = new int[needleLen];
int len = 0;
int idx = 1;
while (idx < needleLen) {
if (needle.charAt(idx) == needle.charAt(len)) {
len += 1;
lps[idx] = len;
idx += 1;
} else {
if (len == 0) {
lps[idx] = 0;
idx += 1;
} else {
len = lps[len - 1];
}
}
}
// PATTERN SEARCHING
int hIndex = 0;
int nIndex = 0;
while (hIndex < haystackLen) {
if (haystack.charAt(hIndex) == needle.charAt(nIndex)) {
nIndex += 1;
hIndex += 1;
if (nIndex == needleLen) {
return hIndex - needleLen;
}
} else {
if (nIndex == 0) {
hIndex += 1;
} else {
nIndex = lps[nIndex - 1];
}
}
}
return -1;
}
}
This solution provides a method for finding the index of the first occurrence of one string (needle
) within another string (haystack
) using the Java language. The method, findSubstring
, explores the task using two main computational techniques—prefix computation and pattern searching—optimized by the Knuth-Morris-Pratt (KMP) string matching algorithm. Here's a summarized breakdown of how the solution operates:
Initialization: The function begins by determining the length of both
needle
andhaystack
. Ifneedle
is longer thanhaystack
, it immediately returns -1, indicating thatneedle
cannot be found withinhaystack
.Prefix Table Computation:
- Initializes an array
lps
that will store the length of the longest proper prefix which is also a suffix for each prefix ofneedle
. - Utilizes a while-loop to populate the
lps
array, which is critical in allowing the KMP algorithm to skip unnecessary comparisons.
- Initializes an array
Pattern Searching:
- Iteratively examines each character of
haystack
to check for the occurrence ofneedle
starting from that position. - Utilizes the
lps
table to skip redundant comparisons, thus improving search efficiency. - If the end of
needle
is reached during these comparisons, it calculates and returns the starting index ofneedle
inhaystack
. - If no match is found after checking all possible starting positions in
haystack
, it returns -1.
- Iteratively examines each character of
This approach, leveraging the efficiency of KMP, ensures a more optimal performance than a naive string searching method, especially for larger strings, by reducing the number of comparisons typically required. Use this implementation to efficiently find the starting index of a substring within a larger string in Java applications.
int findSubstring(char* mainStr, char* subStr) {
int subLen = strlen(subStr);
int mainLen = strlen(mainStr);
if (mainLen < subLen) return -1;
int border[subLen];
border[0] = 0;
int lps = 0;
int idx = 1;
while (idx < subLen) {
if (subStr[idx] == subStr[lps]) {
lps += 1;
border[idx] = lps;
idx += 1;
} else {
if (lps == 0) {
border[idx] = 0;
idx += 1;
} else {
lps = border[lps - 1];
}
}
}
int mainPtr = 0;
int subPtr = 0;
while (mainPtr < mainLen) {
if (mainStr[mainPtr] == subStr[subPtr]) {
subPtr += 1;
mainPtr += 1;
if (subPtr == subLen) {
return mainPtr - subLen;
}
} else {
if (subPtr == 0) {
mainPtr += 1;
} else {
subPtr = border[subPtr - 1];
}
}
}
return -1;
}
In the given C program, the task is to find the index of the first occurrence of a substring (subStr
) within a main string (mainStr
). This is achieved using an efficient string matching algorithm that incorporates the use of a border array to store the longest prefix suffix (LPS) values, which helps in skipping unnecessary comparisons.
- Start by calculating the lengths of both the main string and the substring. If the main string is shorter than the substring, the function immediately returns -1 as it's impossible for the substring to be found.
- Initialize the border array with zeros. The border array plays a crucial role in determining how many characters can be skipped during the string matching process.
- Construct the border array using a loop that compares the substring with itself, thereby determining the LPS for each position. The purpose of this step is to allow the algorithm to skip the matched portion of the substring when a mismatch occurs after some initial matches.
- Initialize two pointers,
mainPtr
andsubPtr
, representing positions within the main string and the substring respectively. - Iterate over the main string using a while loop. If the characters at
mainPtr
andsubPtr
are equal, both pointers are incremented. IfsubPtr
reaches the end ofsubStr
, the start index of the substring in the main string is computed and returned. - If a mismatch occurs and
subPtr
is not at the beginning ofsubStr
,subPtr
is updated to the respective LPS value or the next potential match position withinsubStr
. This allows the algorithm to bypass some comparisons, leveraging the previously computed LPS values. - Continue this process until the main string has been scanned completely or the substring is found.
- If the end of the main string is reached without finding the substring, return -1.
The approach optimizes the substring search by using the border array, making it efficient especially for cases where there are partial matches of large substrates in the main string, reducing the overall potential comparisons dramatically. This optimization is particularly effective in scenarios with repetitive character patterns in the substring.
/**
* @param {string} text
* @param {string} pattern
* @return {number}
*/
var findSubstring = function (text, pattern) {
let pLen = pattern.length;
let tLen = text.length;
if (tLen < pLen) return -1;
// PREPROCESSING
let lps = new Array(pLen);
lps[0] = 0;
let len = 0;
let index = 1;
while (index < pLen) {
if (pattern[index] == pattern[len]) {
len++;
lps[index] = len;
index++;
} else {
if (len == 0) {
lps[index] = 0;
index++;
} else {
len = lps[len - 1];
}
}
}
// MATCHING PROCESS
let txtIdx = 0;
let patIdx = 0;
while (txtIdx < tLen) {
if (text[txtIdx] == pattern[patIdx]) {
patIdx++;
txtIdx++;
if (patIdx == pLen) {
return txtIdx - pLen;
}
} else {
if (patIdx == 0) {
txtIdx++;
} else {
patIdx = lps[patIdx - 1];
}
}
}
return -1;
};
This solution outlines the process for finding the first occurrence of a substring (pattern) in a string (text) using a function written in JavaScript.
- Define a function
findSubstring
that takes two arguments,text
andpattern
. - Initialize variables
pLen
andtLen
to store the lengths ofpattern
andtext
, respectively. - If the text length is shorter than the pattern length, return
-1
as the pattern cannot fit inside the text. - Start by preprocessing the pattern to create the longest prefix which is also suffix (LPS) array. This array is essential for skipping unnecessary comparisons in the pattern during the matching process.
- Set up a loop for generating the LPS array using two pointers,
len
andindex
. Adjust these pointers based on comparisons between characters in the pattern. - After preprocessing, implement the actual pattern matching using two pointers
txtIdx
andpatIdx
for indexing the text and pattern respectively. - Traverse the string and compare characters sequentially. If characters match, increment both pointers. If a complete match of the pattern is found, return the start index of the match in the text.
- If a mismatch occurs, use the LPS array to backtrack
patIdx
without the need to incrementtxtIdx
, effectively skipping redundant comparisons. - If no match is found after traversing the complete text, return
-1
.
This algorithm efficiently reduces the number of comparisons needed and allows an early exit on successful matches, leveraging the preprocessing step for efficient backtracking during mismatch scenarios.
class Solution:
def findSubstring(self, main_string: str, sub_string: str) -> int:
length_sub = len(sub_string)
length_main = len(main_string)
if length_main < length_sub:
return -1
# Build longest prefix suffix array for pattern
lps = [0] * length_sub
len_lps = 0
index = 1
while index < length_sub:
if sub_string[index] == sub_string[len_lps]:
len_lps += 1
lps[index] = len_lps
index += 1
else:
if len_lps != 0:
len_lps = lps[len_lps - 1]
else:
lps[index] = 0
index += 1
# Search for the pattern in the text
i = 0 # index for main_string
j = 0 # index for sub_string
while i < length_main:
if sub_string[j] == main_string[i]:
i += 1
j += 1
if j == length_sub:
return i - j
elif i < length_main and sub_string[j] != main_string[i]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return -1
The provided Python code defines a Solution class with a method findSubstring
that is designed to find the first occurrence of a substring (sub_string
) in a string (main_string
). The method returns the starting index of the substring if it is found, and -1 if it is not found.
Here’s how the code operates:
- First, determine the lengths of
sub_string
andmain_string
. Ifmain_string
is shorter thansub_string
, immediately return -1 as it is impossible forsub_string
to be present. - Use the Knuth-Morris-Pratt (KMP) algorithm to prepare the longest prefix suffix (LPS) array for the substring
sub_string
. This LPS array is crucial for optimizing the search by skipping unnecessary comparisons. - Iterate through
main_string
using a pointeri
and maintain a pointerj
forsub_string
. If characters frommain_string
andsub_string
match, increment both pointers. Oncej
matches the length ofsub_string
, a complete match has been found and return the starting index (i - j
). - If a character mismatch occurs after some matches (
j
> 0), use the LPS array to find the new position ofj
without rechecking previously matched characters. If no characters were matched (j
== 0), simply move thei
pointer. - If the end of
main_string
is reached without finding the substring, return -1.
To implement this:
- Instantiate the Solution class.
- Call the
findSubstring
method with requiredmain_string
andsub_string
as inputs. - Store the returned value to check the index of the first occurrence or determine if the substring is not present.
This method efficiently searches for a substring within a string using the precomputed LPS array to avoid redundant comparisons, thereby optimizing the search operation.
No comments yet.