Repeated DNA Sequences

Updated on 02 July, 2025
Repeated DNA Sequences header image

Problem Statement

In the context of studying genetic data, certain patterns in DNA sequences can be particularly telling. DNA sequences are made up of nucleotides, represented by the characters 'A', 'C', 'G', and 'T'. Analyzing these sequences for repeated patterns can contribute greatly to genetic research and medicine. In this task, you are provided with a string s, representing a DNA sequence, and you need to identify all sequences within s that are exactly 10 letters long and appear more than once.

Examples

Example 1

Input:

s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output:

["AAAAACCCCC","CCCCCAAAAA"]

Example 2

Input:

s = "AAAAAAAAAAAAA"

Output:

["AAAAAAAAAA"]

Constraints

  • 1 <= s.length <= 105
  • s[i] is either 'A', 'C', 'G', or 'T'.

Approach and Intuition

Given the problem, where we need to find repeated sequences of a fixed length (10 letters) within a DNA string, we can adopt an efficient approach to solve this:

  1. Initialization: Begin by creating a dictionary to store sequences as keys and their counts as values. Also, prepare an empty list to store the result.

  2. Sliding Window Technique: Implement a sliding window of size 10 over the string s:

    • As you slide over the DNA sequence, extract each 10-letter-long substring.
    • Check if it exists in the dictionary:
      • If it is already there, increment its count.
      • If not, add it with a count of 1.
  3. Record Repeated Sequences: During the sliding window process, every time a substring's count changes from 1 to 2, add it to the results list. Only update the list at this transition to avoid duplicates in the result.

  4. Output the Results: Once the entire string has been scanned, return the list of repeated sequences.

This method leverages the fixed substring length to maintain manageable memory usage and quick lookup times, making it efficient even for large DNA sequences, thus adhering to the given constraints.

Solutions

  • C++
cpp
class Solution {
public:
    vector<string> collectRepeatedSequences(string dna) {
        int length = 10, sz = dna.size();
        if (sz <= length) return {};
    
        // Create number mapping for nucleotides
        unordered_map<char, int> charMap = {
            {'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
        vector<int> dnaNums(sz);
        for (int i = 0; i < sz; ++i) dnaNums[i] = charMap[dna[i]];
    
        int hash = 0;
        unordered_set<int> seenHashes;
        unordered_set<string> repeatedSequences;
        // Iterate through each DNA sequence of length 'length'
        for (int start = 0; start < sz - length + 1; ++start) {
            // Calculate the rolling hash in constant time
            if (start != 0) {
                hash <<= 2;
                hash |= dnaNums[start + length - 1];
                hash &= ~(3 << 2 * length);
            }
            else {
                for (int i = 0; i < length; ++i) {
                    hash <<= 2;
                    hash |= dnaNums[i];
                }
            }
            // Add to results if the sequence is seen more than once
            if (seenHashes.find(hash) != seenHashes.end())
                repeatedSequences.insert(dna.substr(start, length));
            seenHashes.insert(hash);
        }
        return vector<string>(repeatedSequences.begin(), repeatedSequences.end());
    }
};

The provided C++ solution tackles the problem of identifying repeated DNA sequences of length 10 in a given DNA string. The approach employs a combination of hashing and rolling hash techniques for efficient detection of sequences that appear more than once.

Here's a succinct summary of how the solution operates:

  • Initialization: DNA nucleotides ('A', 'C', 'G', 'T') are mapped to integers (0, 1, 2, 3) for easy processing. A vector is then initialized to store these numerical representations of the DNA string.

  • Hash Setup: A variable for the rolling hash is initialized to zero. Two sets are used: one (seenHashes) to keep track of all distinct hashes (or sequences) encountered, and another (repeatedSequences) to store sequences that repeat.

  • Rolling Hash Calculation: The solution iteratively calculates the hash for each 10-character long substring, using a rolling hash technique to ensure efficient computation. This involves adjusting the hash by shifting it and incorporating the next character into the hash.

    • If the current sequence's hash is found in seenHashes, it suggests a repeat occurrence and is therefore added to the repeatedSequences set.
    • Otherwise, it's simply added to the seenHashes for future reference.
  • Result Compilation: Once all substrings are processed, those recorded in repeatedSequences are compiled into a vector and returned as the result.

This method efficiently reduces the computational complexity compared to naive approaches that might involve comparing each substring with all others, thereby providing a time-optimized solution for detecting repeated sequences in DNA strings. This makes the approach highly suitable for large genomic sequences where performance and speed are critical.

  • Java
java
class Solution {
    
    public List<String> extractRepeatedDnaSequences(String dna) {
        int sequenceLength = 10, totalLength = dna.length();
        if (totalLength <= sequenceLength) return new ArrayList();
    
        int baseVal = 4, baseL = (int) Math.pow(baseVal, sequenceLength);
    
        Map<Character, Integer> charMap = new HashMap() {
            {
                put('A', 0);
                put('C', 1);
                put('G', 2);
                put('T', 3);
            }
        };
        int[] converted = new int[totalLength];
        for (int i = 0; i < totalLength; ++i) converted[i] = charMap.get(dna.charAt(i));
    
        int hash = 0;
        Set<Integer> preSeen = new HashSet();
        Set<String> duplicates = new HashSet();
        for (int start = 0; start < totalLength - sequenceLength + 1; ++start) {
            if (start != 0) {
                hash <<= 2;
                hash |= converted[start + sequenceLength - 1];
                hash &= ~(3 << (2 * sequenceLength));
            }
            else {
                for (int i = 0; i < sequenceLength; ++i) {
                    hash <<= 2;
                    hash |= converted[i];
                }
            }
    
            if (preSeen.contains(hash)) duplicates.add(
                dna.substring(start, start + sequenceLength)
            );
            preSeen.add(hash);
        }
        return new ArrayList<String>(duplicates);
    }
}

This Java-based solution seeks to identify all the repeated DNA sequences in a given DNA string where each DNA sequence is exactly 10 characters long. The method extractRepeatedDnaSequences efficiently handles this task through a combination of hashing and sliding window techniques, ensuring optimal performance even for large input strings.

  • Begin by checking if the input DNA string is shorter than 10 characters. If so, return an empty list because no repeated sequences can exist.
  • Establish a hash map (charMap) to convert the DNA sequence's characters (A, C, G, T) into unique integers for easier manipulation.
  • Convert the DNA string into an array of these integers for swift computation.
  • Use a hashing approach to encode sequences into integers. Update the hash for each new sequence by removing the oldest character and adding the newest, simulating a sliding window of size 10 over the DNA string.
  • Track previously seen hashes using a Set. If any hash has been seen before, the corresponding DNA sequence is added to the results set.
  • Continue this process for the entire length of the DNA string.
  • Convert the set of duplicate sequences into a list and return it.

The design choice to use both a set and a hash ensures that even with the vast amount of data, the solution remains time and space-efficient by preventing repeated enumeration of duplicate sequences. This approach guarantees that sequence look-up and insertion operations are fast, allowing the algorithm to handle large inputs effectively.

  • Python
python
class Solution:
    def repeatedDNASequences(self, dna: str) -> List[str]:
        length, total_length = 10, len(dna)
        if total_length <= length:
            return []
    
        char_to_bit = {"A": 0, "C": 1, "G": 2, "T": 3}
        dna_bits = [char_to_bit.get(dna[index]) for index in range(total_length)]
    
        hash_value = 0
        visited, result = set(), set()
        for start_index in range(total_length - length + 1):
            if start_index != 0:
                hash_value <<= 2
                hash_value |= dna_bits[start_index + length - 1]
                hash_value &= ~(3 << 2 * length)
            else:
                for pos in range(length):
                    hash_value <<= 2
                    hash_value |= dna_bits[pos]
            if hash_value in visited:
                result.add(dna[start_index : start_index + length])
            visited.add(hash_value)
        return list(result)

The provided Python solution efficiently identifies all sequences of DNA that appear more than once in a given input string. The sequences have a length of exactly 10 nucleotides. Here's how the solution works:

  • Firstly, initial checks ensure that sequences shorter than 10 nucleotides immediately return an empty list, as no repeated sequences of length 10 can exist in such cases.
  • The solution maps each DNA character to a unique binary representation for more efficient hashing: A, C, G, and T to 0, 1, 2, and 3 respectively.
  • A rolling hash technique is used to generate a hash value for each 10-character window in the DNA string by traversing it only once. As the window slides, the hash value is updated by removing the bit representation of the oldest character and adding the new character's bits.
  • Sets are used for checking duplicates:
    • visited stores hash values of seen DNA sequences.
    • result collects sequences that appear more than once.
  • If a hash value for a current window exists in the visited set, the actual 10-character DNA sequence corresponding to this hash value is added to result.
  • Finally, the solution returns the sequences stored in result.

This approach significantly reduces the number of unnecessary string comparisons and leverages bitwise operations for maintaining and updating hash values, which are computationally efficient. The solution utilizes Python's advanced list comprehension and set operations to create an optimal system for detecting repeated DNA sequences.

Comments

No comments yet.