28. Find the Index of the First Occurrence in a String - Solutions and Explanation

Find the Index of the First Occurrence in a String header image

Problem Statement

In this task, you are given two strings, needle and haystack, and are required to determine the first position at which the substring needle appears fully within the string haystack. If the substring needle cannot be located within haystack at all, the function should return an index of -1. This simple yet foundational task in string searching algorithms animates many real-world applications like text editing tools, search engines, and data processing systems, where such pattern matching is crucial for efficient operations.

Examples

Example 1

Input:

haystack = "sadbutsad", needle = "sad"

Output:

Explanation:

"sad" occurs at index 0 and 6.
The first occurrence is at index 0, so we return 0.

Example 2

Input:

haystack = "vultrcode", needle = "vultro"

Output:

-1

Explanation:

"vultro" did not occur in "vultrcode", so we return -1.

Constraints

1 <= haystack.length, needle.length <= 104
haystack and needle consist of only lowercase English characters.

Approach and Intuition

The problem of finding a substring within a string can be approached using multiple strategies. To flesh out the intuition and the applicable solution strategies based on the examples provided along with the constraints:

Brute Force Approach:
- Iterate over the haystack using a sliding window of size equal to the length of needle.
- For each position in haystack, compare the substring of length equal to needle to needle itself.
- If a match occurs, that starting index is the desired result.
- If no such match is found throughout the loop, return -1.
Optimized Algorithms:
- Beyond the brute force method, more sophisticated string searching algorithms exist, such as the Knuth-Morris-Pratt algorithm, the Boyer-Moore algorithm, and the Rabin-Karp algorithm.
- These algorithms generally improve on the worst-case time complexity of the brute force approach by preprocessing needle or utilizing information gathered during the initial failed matches to skip sections of haystack, thus reducing the number of comparisons necessary.

Examples from the problem statement can help understand this:

Example 1:
- haystack = "sadbutsad", needle = "sad"
- The substring "sad" appears right at the beginning of "sadbutsad". The brute force method would identify "sad" at index 0 of "sadbutsad" without needing to check further substrings.
Example 2:
- haystack = "vultrcode", needle = "vultro"
- Here, the substring "vultro" does not appear anywhere in "vultrcode". As we slide through "vultrcode", no substring matches "vultro". Therefore, the search concludes with -1.

Application of Constraints:

Since the length of haystack and needle can go up to 104, efficiency matters, but the brute force method is often feasible for moderate sizes, particularly when needle is small compared to haystack.
The constraint that both haystack and needle consist of only lowercase English characters simplifies the comparison process as there is no need to handle case sensitivity or a wider range of characters.

These points outline not just an approach but a clear plan on tackling this problem using straightforward programming techniques with considerations for efficiency improvements.

Solutions

C++
Java
C
JavaScript
Python

                            cpp
                            
                        
class Solution {
public:
    int findSubstring(string txt, string pat) {
        int patLen = pat.size();
        int txtLen = txt.size();
    
        if (txtLen < patLen) return -1;
    
        vector<int> lps(patLen);
        int j = 0;
        int i = 1;
    
        while (i < patLen) {
            if (pat[i] == pat[j]) {
                j += 1;
                lps[i] = j;
                i += 1;
            } else {
                if (j == 0) {
                    lps[i] = 0;
                    i += 1;
                } else {
                    j = lps[j - 1];
                }
            }
        }
    
        int txtIndex = 0;
        int patIndex = 0;
    
        while (txtIndex < txtLen) {
            if (txt[txtIndex] == pat[patIndex]) {
                patIndex += 1;
                txtIndex += 1;
                if (patIndex == patLen) {
                    return txtIndex - patLen;
                }
            } else {
                if (patIndex == 0) {
                    txtIndex += 1;
                } else {
                    patIndex = lps[patIndex - 1];
                }
            }
        }
    
        return -1;
    }
};

This guide provides an explanation of a C++ solution method to determine the index of the first occurrence of a substring (pat) within a given string (txt). The solution employs the Knuth-Morris-Pratt (KMP) algorithm, which is efficient for this task because it skips unnecessary comparisons.

The function findSubstring accepts two strings, txt (the main string) and pat (the pattern or substring to find).
Start by obtaining the lengths of both txt and pat. If txt is shorter than pat, return -1 as the substring cannot be present.
Prepare a vector lps to store the longest prefix suffix values for pattern pat. This vector is crucial for skipping unnecessary checks in case of a mismatch.
Initialize variables i and j to manage the current position in pat while filling the lps array. Iterate through pat to calculate the lps values.
With the lps array ready, use two pointers, txtIndex and patIndex, initialized to zero. These pointers traverse through txt and pat, respectively.
If characters at txt[txtIndex] and pat[patIndex] match, move both pointers forward. If patIndex reaches the length of pat, a match has been found, and return the start index of this match in txt.
Upon a mismatch, if patIndex is not zero, update it using the last known lps value (this represents the next position of the pattern to check). If patIndex is zero, simply move txtIndex forward to continue checking.
If you exhaust txt and find no match, return -1.

The function finishes by returning either the starting index of the first match or -1 if no match is found, thus efficiently solving the problem using the KMP algorithm.

                            java
                            
                        
class Solution {
    public int findSubstring(String haystack, String needle) {
        int needleLen = needle.length();
        int haystackLen = haystack.length();
    
        if (haystackLen < needleLen) return -1;
    
        // PREFIX TABLE COMPUTATION
        int[] lps = new int[needleLen];
        int len = 0;
        int idx = 1;
    
        while (idx < needleLen) {
            if (needle.charAt(idx) == needle.charAt(len)) {
                len += 1;
                lps[idx] = len;
                idx += 1;
            } else {
                if (len == 0) {
                    lps[idx] = 0;
                    idx += 1;
                } else {
                    len = lps[len - 1];
                }
            }
        }
    
        // PATTERN SEARCHING
        int hIndex = 0;
        int nIndex = 0;
    
        while (hIndex < haystackLen) {
            if (haystack.charAt(hIndex) == needle.charAt(nIndex)) {
                nIndex += 1;
                hIndex += 1;
                if (nIndex == needleLen) {
                    return hIndex - needleLen;
                }
            } else {
                if (nIndex == 0) {
                    hIndex += 1;
                } else {
                    nIndex = lps[nIndex - 1];
                }
            }
        }
    
        return -1;
    }
}

This solution provides a method for finding the index of the first occurrence of one string (needle) within another string (haystack) using the Java language. The method, findSubstring, explores the task using two main computational techniques—prefix computation and pattern searching—optimized by the Knuth-Morris-Pratt (KMP) string matching algorithm. Here's a summarized breakdown of how the solution operates:

Initialization: The function begins by determining the length of both needle and haystack. If needle is longer than haystack, it immediately returns -1, indicating that needle cannot be found within haystack.
Prefix Table Computation:
- Initializes an array lps that will store the length of the longest proper prefix which is also a suffix for each prefix of needle.
- Utilizes a while-loop to populate the lps array, which is critical in allowing the KMP algorithm to skip unnecessary comparisons.
Pattern Searching:
- Iteratively examines each character of haystack to check for the occurrence of needle starting from that position.
- Utilizes the lps table to skip redundant comparisons, thus improving search efficiency.
- If the end of needle is reached during these comparisons, it calculates and returns the starting index of needle in haystack.
- If no match is found after checking all possible starting positions in haystack, it returns -1.

This approach, leveraging the efficiency of KMP, ensures a more optimal performance than a naive string searching method, especially for larger strings, by reducing the number of comparisons typically required. Use this implementation to efficiently find the starting index of a substring within a larger string in Java applications.

                            c
                            
                        
int findSubstring(char* mainStr, char* subStr) {
    int subLen = strlen(subStr);
    int mainLen = strlen(mainStr);
    
    if (mainLen < subLen) return -1;
    
    int border[subLen];
    border[0] = 0;
    int lps = 0;
    int idx = 1;
    
    while (idx < subLen) {
        if (subStr[idx] == subStr[lps]) {
            lps += 1;
            border[idx] = lps;
            idx += 1;
        } else {
            if (lps == 0) {
                border[idx] = 0;
                idx += 1;
            } else {
                lps = border[lps - 1];
            }
        }
    }
    
    int mainPtr = 0;
    int subPtr = 0;
    
    while (mainPtr < mainLen) {
        if (mainStr[mainPtr] == subStr[subPtr]) {
            subPtr += 1;
            mainPtr += 1;
            if (subPtr == subLen) {
                return mainPtr - subLen;
            }
        } else {
            if (subPtr == 0) {
                mainPtr += 1;
            } else {
                subPtr = border[subPtr - 1];
            }
        }
    }
    
    return -1;
}

In the given C program, the task is to find the index of the first occurrence of a substring (subStr) within a main string (mainStr). This is achieved using an efficient string matching algorithm that incorporates the use of a border array to store the longest prefix suffix (LPS) values, which helps in skipping unnecessary comparisons.

Start by calculating the lengths of both the main string and the substring. If the main string is shorter than the substring, the function immediately returns -1 as it's impossible for the substring to be found.
Initialize the border array with zeros. The border array plays a crucial role in determining how many characters can be skipped during the string matching process.
Construct the border array using a loop that compares the substring with itself, thereby determining the LPS for each position. The purpose of this step is to allow the algorithm to skip the matched portion of the substring when a mismatch occurs after some initial matches.
Initialize two pointers, mainPtr and subPtr, representing positions within the main string and the substring respectively.
Iterate over the main string using a while loop. If the characters at mainPtr and subPtr are equal, both pointers are incremented. If subPtr reaches the end of subStr, the start index of the substring in the main string is computed and returned.
If a mismatch occurs and subPtr is not at the beginning of subStr, subPtr is updated to the respective LPS value or the next potential match position within subStr. This allows the algorithm to bypass some comparisons, leveraging the previously computed LPS values.
Continue this process until the main string has been scanned completely or the substring is found.
If the end of the main string is reached without finding the substring, return -1.

The approach optimizes the substring search by using the border array, making it efficient especially for cases where there are partial matches of large substrates in the main string, reducing the overall potential comparisons dramatically. This optimization is particularly effective in scenarios with repetitive character patterns in the substring.

                            js
                            
                        
/**
 * @param {string} text
 * @param {string} pattern
 * @return {number}
 */
var findSubstring = function (text, pattern) {
    let pLen = pattern.length;
    let tLen = text.length;
    
    if (tLen < pLen) return -1;
    
    // PREPROCESSING
    let lps = new Array(pLen);
    lps[0] = 0;
    let len = 0;
    let index = 1;
    
    while (index < pLen) {
        if (pattern[index] == pattern[len]) {
            len++;
            lps[index] = len;
            index++;
        } else {
            if (len == 0) {
                lps[index] = 0;
                index++;
            } else {
                len = lps[len - 1];
            }
        }
    }
    
    // MATCHING PROCESS
    let txtIdx = 0;
    let patIdx = 0;
    
    while (txtIdx < tLen) {
        if (text[txtIdx] == pattern[patIdx]) {
            patIdx++;
            txtIdx++;
            if (patIdx == pLen) {
                return txtIdx - pLen;
            }
        } else {
            if (patIdx == 0) {
                txtIdx++;
            } else {
                patIdx = lps[patIdx - 1];
            }
        }
    }
    
    return -1;
};

This solution outlines the process for finding the first occurrence of a substring (pattern) in a string (text) using a function written in JavaScript.

Define a function findSubstring that takes two arguments, text and pattern.
Initialize variables pLen and tLen to store the lengths of pattern and text, respectively.
If the text length is shorter than the pattern length, return -1 as the pattern cannot fit inside the text.
Start by preprocessing the pattern to create the longest prefix which is also suffix (LPS) array. This array is essential for skipping unnecessary comparisons in the pattern during the matching process.
Set up a loop for generating the LPS array using two pointers, len and index. Adjust these pointers based on comparisons between characters in the pattern.
After preprocessing, implement the actual pattern matching using two pointers txtIdx and patIdx for indexing the text and pattern respectively.
Traverse the string and compare characters sequentially. If characters match, increment both pointers. If a complete match of the pattern is found, return the start index of the match in the text.
If a mismatch occurs, use the LPS array to backtrack patIdx without the need to increment txtIdx, effectively skipping redundant comparisons.
If no match is found after traversing the complete text, return -1.

This algorithm efficiently reduces the number of comparisons needed and allows an early exit on successful matches, leveraging the preprocessing step for efficient backtracking during mismatch scenarios.

                            python
                            
                        
class Solution:
    def findSubstring(self, main_string: str, sub_string: str) -> int:
        length_sub = len(sub_string)
        length_main = len(main_string)
    
        if length_main < length_sub:
            return -1
    
        # Build longest prefix suffix array for pattern
        lps = [0] * length_sub
        len_lps = 0
        index = 1
    
        while index < length_sub:
            if sub_string[index] == sub_string[len_lps]:
                len_lps += 1
                lps[index] = len_lps
                index += 1
            else:
                if len_lps != 0:
                    len_lps = lps[len_lps - 1]
                else:
                    lps[index] = 0
                    index += 1
    
        # Search for the pattern in the text
        i = 0  # index for main_string
        j = 0  # index for sub_string
    
        while i < length_main:
            if sub_string[j] == main_string[i]:
                i += 1
                j += 1
                
            if j == length_sub:
                return i - j
                
            elif i < length_main and sub_string[j] != main_string[i]:
                if j != 0:
                    j = lps[j - 1]
                else:
                    i += 1
    
        return -1

The provided Python code defines a Solution class with a method findSubstring that is designed to find the first occurrence of a substring (sub_string) in a string (main_string). The method returns the starting index of the substring if it is found, and -1 if it is not found.

Here’s how the code operates:

First, determine the lengths of sub_string and main_string. If main_string is shorter than sub_string, immediately return -1 as it is impossible for sub_string to be present.
Use the Knuth-Morris-Pratt (KMP) algorithm to prepare the longest prefix suffix (LPS) array for the substring sub_string. This LPS array is crucial for optimizing the search by skipping unnecessary comparisons.
Iterate through main_string using a pointer i and maintain a pointer j for sub_string. If characters from main_string and sub_string match, increment both pointers. Once j matches the length of sub_string, a complete match has been found and return the starting index (i - j).
If a character mismatch occurs after some matches (j > 0), use the LPS array to find the new position of j without rechecking previously matched characters. If no characters were matched (j == 0), simply move the i pointer.
If the end of main_string is reached without finding the substring, return -1.

To implement this:

Instantiate the Solution class.
Call the findSubstring method with required main_string and sub_string as inputs.
Store the returned value to check the index of the first occurrence or determine if the substring is not present.

This method efficiently searches for a substring within a string using the precomputed LPS array to avoid redundant comparisons, thereby optimizing the search operation.

Comments

No comments yet.

Find the Index of the First Occurrence in a String

Problem Statement

Examples

Example 1

Example 2

Constraints

Approach and Intuition

Solutions

Comments

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs