Substring with Concatenation of All Words

Updated on 25 June, 2025
Substring with Concatenation of All Words header image

Problem Statement

In the given task, you are provided with a string s and an array of strings words. Each string in words is of equal length. You are required to determine all possible starting indices in the string s where a substring is formed by any permutation of concatenation of all strings in words. This specific kind of substring is termed as a "concatenated string." It is essential that the concatenated substring includes all strings from words in any order without omission or repetition.

For illustration, consider words = ["ab","cd","ef"]; valid concatenated strings in a scenario like this would include combinations such as "abcdef", "abefcd", and so forth, where each element of words appears exactly once in the substring. Conversely, strings like "acdbef" would not count since the order does not represent a direct permutation concatenation of words.

The result should be an array of integers indicating the starting indices of such valid concatenated substrings within s.

Examples

Example 1

Input:

s = "barfoothefoobarman", words = ["foo","bar"]

Output:

[0,9]

Explanation:

The substring starting at 0 is `"barfoo"`. It is the concatenation of `["bar","foo"]` which is a permutation of `words`.

Example 2

Input:

s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"]

Output:

[]

Explanation:

There is no concatenated substring.

Example 3

Input:

s = "barfoofoobarthefoobarman", words = ["bar","foo","the"]

Output:

[6,9,12]

Explanation:

The substring starting at 6 is `"foobarthe"`. It is the concatenation of `["foo","bar","the"]`.

Constraints

  • 1 <= s.length <= 104
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 30
  • s and words[i] consist of lowercase English letters.

Approach and Intuition

To solve this problem, consider the following steps:

  1. Verify Input Conditions: Check if the length of 's' or 'words' are within the allowed constraints that could pragmatically allow for a solution. For instance, if words is empty or the combined length of words exceeds s, there can't be any valid substrings.

  2. Calculate Parameters:

    • Determine the length of each word in words (noted as all words are of the same length).
    • Calculate the total length of a concatenated substring which would be a multiple of the individual word length and the total number of words.
  3. Sliding Window Technique: Use a sliding window approach in combination with a hashmap to track and verify counted words:

    • Slide across s with a window size equal to the calculated concatenated substring length.
    • Within each window, break the substring into segments of the word's length and verify against a hashmap which keeps a count of words needed.
    • Adjust the starting point of the sliding window based on whether the current substring segment aligns with any word in words.
  4. Matching and Collecting Indices:

    • If a window's segments exactly match the data structure outlining the required word counts (each word from words appears exactly as expected), record the starting index.
    • Continue this verification for each possible window in s.
  5. Return Results: Collect all valid indices into an array and return this as the solution.

This approach efficiently combines the fixed length of words elements and the possibility of permutation to validate substring sequences in a controlled manner. Implementing this with attention to window shifts and effective checking against expected word occurrences ensures that unnecessary checks are minimized, leveraging time complexity. Additionally, the use of hashmaps aids in quick lookups and comparisons, critical for performance given the potentially high constraints.

Solutions

  • C++
  • Java
  • C
  • JavaScript
  • Python
cpp
class Solution {
private:
    unordered_map<string, int> frequencyMap;
    int stringLength;
    int wordSize;
    int totalLength;
    int totalWords;
    void examineSubstring(int startPos, string &s, vector<int> &results) {
        unordered_map<string, int> currentCount;
        int validWords = 0;
        bool extraWord = false;
        for (int pos = startPos; pos <= stringLength - wordSize; pos += wordSize) {
            string tempSub = s.substr(pos, wordSize);
            if (!frequencyMap.count(tempSub)) {
                // Reset the window due to a non-required word
                currentCount.clear();
                validWords = 0;
                extraWord = false;
                startPos = pos + wordSize;
            } else {
                // Adjust window if necessary
                while (pos - startPos == totalLength || extraWord) {
                    string leftWord = s.substr(startPos, wordSize);
                    startPos += wordSize;
                    currentCount[leftWord]--;
                    if (currentCount[leftWord] >= frequencyMap[leftWord]) {
                        // Handle excessive instances of a word
                        extraWord = false;
                    } else {
                        // Handle a needed word
                        validWords--;
                    }
                }
                // Track appearances of the current word
                currentCount[tempSub]++;
                if (currentCount[tempSub] <= frequencyMap[tempSub]) {
                    validWords++;
                } else {
                    // More instances than needed
                    extraWord = true;
                }
                if (validWords == totalWords && !extraWord) {
                    // Valid substring found
                    results.push_back(startPos);
                }
            }
        }
    }

public:
    vector<int> findSubstring(string s, vector<string> &words) {
        stringLength = s.size();
        totalWords = words.size();
        wordSize = words[0].size();
        totalLength = wordSize * totalWords;
        for (string &word : words) {
            frequencyMap[word]++;
        }
        vector<int> results;
        for (int i = 0; i < wordSize; i++) {
            examineSubstring(i, s, results);
        }
        return results;
    }
};

This C++ solution provides an efficient implementation for finding all starting indices of a substring which is the concatenation of each word from a given list of words exactly once, without any intermediate characters. This implementation involves a single-pass sliding window technique combined with hashmaps for maintaining the count of words.

  • The primary idea is to compute the frequency of each word in the provided list using an unordered_map called frequencyMap.
  • Define stringLength, wordSize, totalLength, and totalWords for efficient management of substring checks throughout the main string s.
  • Use a helper function examineSubstring that utilizes two pointers technique. It processes the string by each word-length using a sliding window, checking every possible substring of length equal to the total combined length of the words.
  • In the sliding window:
    • Extract a word from the main string and assess its relevance by looking up frequencyMap.
    • Adjust the current count map if an extracted word doesn't align with the required frequency or is extra. It also resets the statistical count if irrelevant words intervene.
    • A vector results is then populated with start indices every time the accumulated valid words form a valid concatenation sequence without any extra words.
  • Finally, for each possible starting character (up to the length of a word), apply the sliding window technique to ensure all permutations caused by the word arrangement are covered.

Performance

The solution should perform at an efficiency close to O(n), where n is the length of the string, by minimizing redundant checks and not re-examining each word more than necessary. This optimal performance is achieved through the effective use of hashmaps and the two-pointer technique.

Usage

The findSubstring function provided can be directly used by passing the main string and the list of words. This function will return a vector containing the starting indices of the substrings in the main string that match the required concatenation pattern. This solution is particularly useful for parsing and complex string manipulation tasks involving patterns of words.

java
class Solution {
    private HashMap<String, Integer> frequencyMap = new HashMap<>();
    private int stringLength;
    private int wordSize;
    private int totalSize;
    private int totalWords;

    private void windowSearch(int start, String str, List<Integer> results) {
        HashMap<String, Integer> foundWords = new HashMap<>();
        int countUsed = 0;
        boolean overusedWord = false;

        for (int end = start; end <= stringLength - wordSize; end += wordSize) {
            String word = str.substring(end, end + wordSize);
            if (!frequencyMap.containsKey(word)) {
                foundWords.clear();
                countUsed = 0;
                overusedWord = false;
                start = end + wordSize;
            } else {
                while (end - start == totalSize || overusedWord) {
                    String firstWord = str.substring(start, start + wordSize);
                    start += wordSize;
                    foundWords.put(
                        firstWord,
                        foundWords.get(firstWord) - 1
                    );

                    if (foundWords.get(firstWord) < frequencyMap.get(firstWord)) {
                        countUsed--;
                    } else {
                        overusedWord = false;
                    }
                }

                foundWords.put(word, foundWords.getOrDefault(word, 0) + 1);
                if (foundWords.get(word) <= frequencyMap.get(word)) {
                    countUsed++;
                } else {
                    overusedWord = true;
                }

                if (countUsed == totalWords && !overusedWord) {
                    results.add(start);
                }
            }
        }
    }

    public List<Integer> findSubstring(String s, String[] words) {
        stringLength = s.length();
        totalWords = words.length;
        wordSize = words[0].length();
        totalSize = wordSize * totalWords;

        for (String word : words) {
            frequencyMap.put(word, frequencyMap.getOrDefault(word, 0) + 1);
        }

        List<Integer> results = new ArrayList<>();
        for (int i = 0; i < wordSize; i++) {
            windowSearch(i, s, results);
        }

        return results;
    }
}

The Java solution provided addresses the problem of finding substrings in s that are the concatenation of all words given in the array words. Each word in the array is of the same length.

The approach employs a sliding window technique along with hashmap to manage the word frequencies and check the constructed substrings effectively:

  • Initialize hashmaps for the target word frequencies (frequencyMap) and for counting word occurrences during the search (foundWords).
  • Calculate necessary metrics like length of s, number of words, size of each word, and the total size of the substring formed by concatenating all words.
  • For every valid starting position in s up to the length of a word (wordSize), employ windowSearch to find qualifying substrings.
    • In windowSearch, iterate over the string and extract each word of length wordSize.
    • If the word is not in frequencyMap, reset the search parameters. If the word is found but results in an overflow (overusedWord), adjust the starting position and update counters.
    • When the window of words matches both the content and frequency of the words in the frequencyMap without overflow, add the start index to results.

The result list, compiled during each sliding window operation, contains all starting indices of substrings in s that are a concatenation of all words in words.

The use of double hashmap and a sliding window technique ensures efficient checking and updates within the loop, allowing the method to handle cases with words of identical size and disregard order or overlaps directly. This method delivers a list of all valid starting indices for the desired substrings in s.

c
struct map_item {
    char identifier[1024];
    int value;
    UT_hash_handle hh;
};

void modify_value(struct map_item **hashmap, char *key, int delta) {
    struct map_item *item;
    HASH_FIND_STR(*hashmap, key, item);
    item->value += delta;
}

void insert_map(struct map_item **hashmap, char *key, int value) {
    struct map_item *item;
    HASH_FIND_STR(*hashmap, key, item);
    if (item == NULL) {
        item = (struct map_item *)malloc(sizeof(struct map_item));
        strncpy(item->identifier, key, strlen(key) + 1);
        HASH_ADD_STR(*hashmap, identifier, item);
    }
    item->value = value;
}

bool key_exists(struct map_item *hashmap, char *key) {
    struct map_item *item;
    HASH_FIND_STR(hashmap, key, item);
    return item != NULL;
}

int get_value(struct map_item *hashmap, char *key) {
    struct map_item *item;
    HASH_FIND_STR(hashmap, key, item);
    return item ? item->value : 0;
}

void clear_map(struct map_item **hashmap) {
    struct map_item *current, *tmp;
    HASH_ITER(hh, *hashmap, current, tmp) {
        HASH_DEL(*hashmap, current);
        free(current);
    }
    *hashmap = NULL;
}

int *find_substring(char *s, char **words, int wordsSize, int *returnSize) {
    int n = strlen(s);
    int k = wordsSize;
    int word_len = strlen(words[0]);
    int substr_size = word_len * k;
    struct map_item *word_count = NULL;

    for (int i = 0; i < wordsSize; ++i) {
        insert_map(&word_count, words[i], get_value(word_count, words[i]) + 1);
    }

    int *results = (int *)malloc(sizeof(int) * 0);
    int count = 0;

    for (int i = 0; i < word_len; i++) {
        struct map_item *words_found = NULL;
        int count_found = 0;
        bool overcount = false;
        char *temp_word = (char *)malloc(word_len + 1);
        temp_word[word_len] = '\0';

        int start = i;
        for (int end = i; end < n; end += word_len) {
            if (end + word_len > n) break;
            strncpy(temp_word, &s[end], word_len);

            if (!key_exists(word_count, temp_word)) {
                clear_map(&words_found);
                count_found = 0;
                overcount = false;
                start = end + word_len;
            } else {
                char *word_at_start = (char *)malloc(word_len + 1);
                word_at_start[word_len] = '\0';

                while (end - start == substr_size || overcount) {
                    strncpy(word_at_start, &s[start], word_len);
                    start += word_len;
                    modify_value(&words_found, word_at_start, -1);

                    if (get_value(words_found, word_at_start) ==
                        get_value(word_count, word_at_start)) {
                        overcount = false;
                    } else {
                        count_found -= 1;
                    }
                }
                insert_map(&words_found, temp_word, get_value(words_found, temp_word) + 1);

                if (get_value(words_found, temp_word) <=
                    get_value(word_count, temp_word)) {
                    count_found += 1;
                } else {
                    overcount = true;
                }

                if (count_found == k && !overcount) {
                    results = (int *)realloc(results, sizeof(int) * (++count));
                    results[count - 1] = start;
                }
            }
        }
    }
    *returnSize = count;
    return results;
}

This solution is designed to solve the problem of finding all the starting indices of substring(s) in a given string s that is a concatenation of each word in the words array exactly once and without any intervening characters.

Here's a brief breakdown of how the solution operates:

  • Hash Map Manipulation:

    • The provided code initializes custom hash maps (using uthash for operations in C) to store and track word frequencies and occurrences.
    • Key functions like insert_map, modify_value, and get_value are optimized for operations such as inserting a new word, modifying the count of an existing word, and fetching the count of a word, respectively.
  • Sliding Window over the String s:

    • The main function find_substring implements a sliding window technique to traverse the string s. A dynamic window size, based on the total length of all words in words multiplied by their count, is used.
    • The window slides over s in increments of the length of the words to match the pattern exactly.
  • Word Matching and Frequency Checking:

    • As the window slides, words in the current segment are compared against the words in the words list by looking them up in the hash map of expected word counts.
    • If a word does not exist in the hash map or exceeds the expected frequency (indicative of over counting), the sliding window resets, and word frequency counts are adjusted.
  • Index Storage and Return:

    • All starting indices where a valid concatenation starts are stored in an array results.
    • Finally, this array along with the count of these starting indices is returned.

This method ensures efficiency by avoiding unnecessary re-computation and directly leveraging hash map functionalities for constant-time operations. Despite being written in C, which requires manual memory management (as seen with the use of malloc and free), the solution handles dynamic collection sizes well, resizing where necessary, ensuring that the program remains efficient and robust during execution.

js
var locateSubstrings = function (str, wordList) {
    if (wordList.length === 0) return [];
    const wordFrequency = new Map();
    const lenWord = wordList[0].length;
    const targetLength = wordList.length * lenWord;
    const strLength = str.length;
    const matches = [];
    for (let word of wordList) {
        wordFrequency.set(word, (wordFrequency.get(word) || 0) + 1);
    }
    const checkWindow = function (start) {
        const foundWords = new Map();
        let countWords = 0;
        let overCount = false;
        for (let end = start; end <= strLength - lenWord; end += lenWord) {
            const piece = str.substring(end, end + lenWord);
            if (!wordFrequency.has(piece)) {
                foundWords.clear();
                countWords = 0;
                overCount = false;
                start = end + lenWord;
            } else {
                while (end - start === targetLength || overCount) {
                    const firstWord = str.substring(start, start + lenWord);
                    start += lenWord;
                    foundWords.set(
                        firstWord,
                        foundWords.get(firstWord) - 1,
                    );
                    if (
                        foundWords.get(firstWord) >=
                        wordFrequency.get(firstWord)
                    ) {
                        overCount = false;
                    } else {
                        countWords--;
                    }
                }
                foundWords.set(piece, (foundWords.get(piece) || 0) + 1);
                if (foundWords.get(piece) <= wordFrequency.get(piece)) {
                    countWords++;
                } else {
                    overCount = true;
                }
                if (countWords === wordList.length && !overCount) {
                    matches.push(start);
                }
            }
        }
    };
    for (let i = 0; i < lenWord; i++) {
        checkWindow(i);
    }
    return matches;
};

The provided JavaScript function locateSubstrings is designed to find all starting indices of substring(s) in a given string str that is a concatenation of all the words in the list wordList, where each word is used exactly once and without any intervening characters. The implementation approach is systematic and ensures that the solution is efficient and can handle variably sized inputs.

  • Initialize various parameters and controls including:

    • wordFrequency to store the frequency of each word in wordList.
    • lenWord to store the length of each word (assuming all words are the same length).
    • targetLength to determine the total length of the concatenated target substring formed from wordList.
    • matches to collect all valid starting indices of substrings in str.
  • The core logic to locate substrings utilizes a sliding window technique:

    1. Iterate through the string str in chunks the size of lenWord.
    2. Check each word-sized piece against wordFrequency to see if it matches any word in wordList.
    3. Manage and adjust a window of checked words using a map foundWords to count occurrences of each word and ensure they do not exceed their required frequency.
    4. Handle cases where an unwanted or excess word disrupts the current sequence by resetting and resizing the window accordingly.
    5. If a full series of wordList words are found in sequence without excess, record the starting index.
  • Optimization by checking windows:

    • Split the string check into as many windows as the length of the words to ensure full coverage of possible starting points in str.

The function finally returns an array of positions (matches), each indicating where a complete concatenation of wordList begins in the string str.

This solution efficiently handles word frequency and ensures the substring search is thorough by using a sliding window mechanism, adjusted dynamically based on content matches and mismatches found during iteration. This combines the advantages of map-based frequency counting and window sliding for effective substring searches in linear-time complexity relative to the size of str and wordList.

python
class Solution:
    def findSubstringIndices(self, string: str, wordlist: List[str]) -> List[int]:
        string_len = len(string)
        word_count = len(wordlist)
        single_word_len = len(wordlist[0])
        target_len = single_word_len * word_count
        word_freq = collections.Counter(wordlist)

        def check_segment(start):
            current_found = collections.defaultdict(int)
            words_collected = 0
            redundant_word = False

            for end in range(start, string_len, single_word_len):
                if end + single_word_len > string_len:
                    break

                word_segment = string[end : end + single_word_len]
                if word_segment not in word_freq:
                    current_found = collections.defaultdict(int)
                    words_collected = 0
                    redundant_word = False
                    start = end + single_word_len
                else:
                    while end - start == target_len or redundant_word:
                        left_word = string[start : start + single_word_len]
                        start += single_word_len
                        current_found[left_word] -= 1

                        if current_found[left_word] == word_freq[left_word]:
                            redundant_word = False
                        else:
                            words_collected -= 1

                    current_found[word_segment] += 1
                    if current_found[word_segment] <= word_freq[word_segment]:
                        words_collected += 1
                    else:
                        redundant_word = True

                    if words_collected == word_count and not redundant_word:
                        results.append(start)

        results = []
        for j in range(single_word_len):
            check_segment(j)

        return results

In the provided Python code, the function findSubstringIndices is designed to identify all starting indices in a given string where a concatenation of all the words from wordlist is found. Each word in wordlist is of the same length.

The implementation includes these steps:

  1. Calculate the needed lengths and count metrics: total string length (string_len), number of words (word_count), length of each word (single_word_len), combined length of all words in the list (target_len), and a frequency counter of the words (word_freq).

  2. The nested function check_segment examines sections of the string beginning at a specified start index. It uses a sliding window approach along with a hashed frequency count (current_found) to manage and compare current word segment counts against the required (word_freq).

  3. If any segment does not match word_freq, adjusts the starting point and resets any collected counts. If over-counting occurs, words are removed from the beginning of the current window until the counts align with the target frequencies.

  4. If a valid segment is discovered that matches word_freq and aligns with the target_len, its start index is saved.

  5. This checking process is carried out for each character in the word (single_word_len), ensuring full coverage of the string.

  6. The list of valid start indices (results) is updated through each complete iteration and ultimately returned.

The algorithm's efficiency is maintained by check_segment, handling dynamic updates and precise alignment with the target word frequencies. This solution effectively solves the problem by breaking down the complex checks and balances into manageable components and using dictionary and set operations for optimal lookup times.

Comments

No comments yet.