692. Top K Frequent Words - Solutions and Explanation

Problem Statement

In this problem, you are provided with an array of strings named words and an integer k. Your task is to determine the k most frequent strings in the array. The output should be ordered such that the strings appear from the highest to the lowest frequency. If multiple words have the same frequency, they should be further sorted in lexicographical order. You need to ensure that elements in the result reflect both the frequency and lexicographical criteria as specified.

Examples

Example 1

Input:

words = ["i","love","platform","i","love","coding"], k = 2

Output:

["i","love"]

Explanation:

"i" and "love" are the two most frequent words.
Note that "i" comes before "love" due to a lower alphabetical order.

Example 2

Input:

words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4

Output:

["the","is","sunny","day"]

Explanation:

"the", "is", "sunny" and "day" are the four most frequent words, with the number of occurrences being 4, 3, 2 and 1 respectively.

Constraints

1 <= words.length <= 500
1 <= words[i].length <= 10
words[i] consists of lowercase English letters.
k is in the range [1, The number of unique words[i]]

Approach and Intuition

To solve the problem of finding the k most frequent words with the criteria of sorting based on frequency and lexicographical order, we can break down the task into manageable steps:

Frequency Calculation: First, we need to calculate the frequency of each word in the words array. This can be efficiently done using a dictionary where the word is the key and its frequency of occurrence is the value.
Sorting by Frequency: Once we have a frequency dictionary, our next goal is to sort this dictionary. The primary sorting is based on frequency in descending order, and the secondary sorting (for words with the same frequency) is based on lexicographical order in ascending order.
Extracting Results: After sorting, the top k words need to be extracted from this sorted list. This involves capturing the first k entries from our sorted structure.

These steps address the problem constraints effectively while ensuring the solution is optimized and follows Python's inherent capabilities with data structures like lists and dictionaries.

Solutions

C++
Java
Python

                            cpp
                            
                        
class Solution {
public:
    int limit;
    
    struct Node {
        Node* child[26];
        bool isEnd;
        Node() {
            isEnd = false;
            for (int j = 0; j < 26; ++j) {
                child[j] = nullptr;
            }
        }
    };
    
    void insertWord(Node& node, const string& element) {
        Node* current = &node;
        for (char ch : element) {
            int index = ch - 'a';
            if (current->child[index] == nullptr) {
                current->child[index] = new Node();
            }
            current = current->child[index];
        }
        current->isEnd = true;
    }
    
    void gatherWords(Node& node, const string& lead, vector<string>& output) {
        if (limit == 0) {
            return;
        }    
        if (node.isEnd) {
            limit--;
            output.emplace_back(lead);
        }
        for (int i = 0; i < 26; ++i) {
            if (node.child[i] != nullptr) {
                gatherWords(*node.child[i], lead + char('a' + i), output);
            }
        }
    }
    
    vector<string> topKFrequent(vector<string>& elements, int k) {
        int size = elements.size();
        this->limit = k;
    
        map<string, int> frequency;
        for (string& element : elements) {
            frequency[element]++;
        }
        vector<string> output;
        vector<Node> tier(size + 1, Node());
        for (auto& pair : frequency) {
            insertWord(tier[pair.second], pair.first);
        }
        for (int i = size; i > 0; i--) {
            gatherWords(tier[i], "", output);
        }
        return output;
    }
};

The provided C++ code offers an optimized solution for finding the top K frequent words from a given list of words. This is achieved through the implementation of trie (prefix tree) data structures combined with sorting strategies based on word frequencies.

The solution process involves several key components and steps:

Node Structure Definition: Each node in the trie contains an array of child nodes, representing characters from 'a' to 'z', and a boolean to indicate if the node marks the completion of a valid word. Each child is initially set to nullptr.
Word Insertion in Trie: The insertWord function inserts each word into the trie corresponding to its frequency. It navigates through each character of the word, calculates its corresponding index, and traverses to child nodes accordingly, creating new nodes when necessary.
Frequency Mapping: Before utilizing the trie, the function first maps each word to its frequency of appearance using a map data structure.
Array of Tries: A dynamic array of trie root nodes (vector<Node> tier) is then created, where each member represents a unique frequency count allowing parallel storage of words by their frequencies.
Gathering Output: Words are gathered in gatherWords function starting from the trie with the highest frequency downward to ensure the top K frequencies are prioritized. The words are added to the output vector until the limit (which keeps track of the count of k frequent words) is reached.
Result Compilation: Ultimately, the vector of strings is compiled and returned, containing the top K frequent words.

This implementation efficiently handles the task of grouping words by their frequency and extracting the most frequent words using both trie data structures and frequency sorting mechanics, optimizing the process by focusing directly on higher frequencies, thereby reducing the operation's overall complexity.

                            java
                            
                        
class Finder {
    private int threshold;
    private List<String> results;
    
    class Node {
        Node[] next;
        boolean endOfWord;
    
        public Node() {
            next = new Node[26];
            endOfWord = false;
        }
    }
    
    public List<String> findTopKFrequent(String[] words, int threshold) {
        this.threshold = threshold;
        results = new ArrayList<>();
        int length = words.length;
        Node[] tiers = new Node[length + 1];
        Map<String, Integer> freqMap = new HashMap<>();
    
        for (String word : words) {
            freqMap.put(word, freqMap.getOrDefault(word, 0) + 1);
        }
    
        for (var entry : freqMap.entrySet()) {
            if (tiers[entry.getValue()] == null) {
                tiers[entry.getValue()] = new Node();
            }
            insertWord(tiers[entry.getValue()], entry.getKey());
        }
    
        for (int i = length; i > 0; i--) {
            if (tiers[i] != null) {
                extractWords(tiers[i], "");
            }
            if (this.threshold == 0) {
                break;
            }
        }
        return results;
    
    }
    
    private void insertWord(Node root, String word) {
        Node current = root;
        for (char c : word.toCharArray()) {
            if (current.next[c - 'a'] == null) {
                current.next[c - 'a'] = new Node();
            }
            current = current.next[c - 'a'];
        }
        current.endOfWord = true;
    }
    
    private void extractWords(Node root, String currentWord) {
        if (threshold == 0) {
            return;
        }
        if (root.endOfWord) {
            threshold--;
            results.add(currentWord);
        }
        for (int i = 0; i < 26; i++) {
            if (root.next[i] != null) {
                extractWords(root.next[i], currentWord + (char) (i + 'a'));
            }
        }
    }
}

The provided Java code defines a method to find the top K frequent words from an array of words. The approach involves several steps, primarily focusing on efficient frequency counting and extracting the most frequent words using a trie-like data structure customized with frequency tiers. Here's the breakdown of the solution:

A Finder class contains methods for handling the operations and maintaining state.
A nested Node class represents each node in the trie with 26 possible children (one for each letter in the Latin alphabet) and a boolean flag to indicate the end of a word.
The findTopKFrequent method initializes necessary data structures including a frequency map and an array of Node objects tiers which represent different frequency levels.
Words are first counted using a hash map to track the frequency of each word.
Words are then inserted into a corresponding tier based on their frequency using the insertWord method which incrementally builds the trie.
The method iteratively checks each tier from the highest frequency downwards. It stops early if the required number of results (threshold) is reached, using the extractWords method to collect words in the trie.
The extractWords method uses a DFS approach to traverse the trie and collect words, decrementing the threshold as words are added to the result list to ensure only the top K words are collected.

This tri-based approach efficiently groups words by their frequency and allows quick access to the most frequent words, significantly optimizing the process for large datasets. The structure and methods ensure that the solution handles the word retrieval in an optimized way intended for performance in scenarios with potentially large or frequent input word sets.

                            python
                            
                        
import string
from collections import Counter
    
# Solution class retains the initial logic, with adjusted variable names for confidentiality
class Solution:
    def frequentWords(self, input_words: list, limit: int) -> list:
        length = len(input_words)
        word_count = Counter(input_words)
        sorted_data = [{} for _ in range(length + 1)]
        self.limit = limit
    
        def insertWord(trie_structure, word):
            pointer = trie_structure
            for char in word:
                if char not in pointer:
                    pointer[char] = {}
                pointer = pointer[char]
            pointer['end'] = {}
    
        def collectWords(node, word_prefix):
            if self.limit == 0:
                return []
            collected = []
            if 'end' in node:
                self.limit -= 1
                collected.append(word_prefix)
            for letter in string.ascii_lowercase:
                if letter in node:
                    collected += collectWords(node[letter], word_prefix + letter)
            return collected
    
        for word, frequency in word_count.items():
            insertWord(sorted_data[frequency], word)
    
        result = []
        for i in range(length, 0, -1):
            if self.limit == 0:
                return result
            if sorted_data[i]:
                result += collectWords(sorted_data[i], '')
        return result

The Python code provided solves the problem of finding the top K frequent words from a list of words. Here’s a concise summary of how the solution is implemented:

The code uses the Python collections.Counter class to count how frequently each word appears in the input list of words, input_words.
Each word frequency is stored using a Trie (or prefix tree) structure for each separate frequency count. This is accomplished using nested dictionaries where each key represents a character in the word.
The insertWord function adds words into an appropriate trie based on their frequency count.
The collectWords function recursively retrieves the words from the trie in lexicographical order until the maximum allowed number of results specified by limit is reached.
After constructing the trie for each frequency, the algorithm processes these tries from highest to lowest frequency, using the collectWords to extract words until the desired number of top frequency words is achieved.

The function frequentWords encapsulates this functionality and will return the list of top K frequent words in lexicographical order up to the specified limit. This solution combines use of a Counter, Trie, and sorting to effectively manage and process the data, ensuring that high-frequency words are picked in sorted order efficiently.

Comments

No comments yet.

Top K Frequent Words

Problem Statement

Examples

Example 1

Example 2

Constraints

Approach and Intuition

Solutions

Comments

Products

Features

Solutions

Marketplace

Resources

Company