UTF-8 Validation

Updated on 04 July, 2025
UTF-8 Validation header image

Problem Statement

The objective is to determine if a sequence of integers correctly encodes a series of UTF-8 characters. UTF-8 character encoding allows for varied-length characters, where each character can span between 1 to 4 bytes. Each integer within the array represents a single byte of a character, and only the least 8 significant bits of each integer are relevant for our encoding evaluation.

We have to verify the sequence against specific UTF-8 encoding rules:

  • For a single-byte character, the leading bit is zero.
  • For a multi-byte (ranging from 2 to 4 bytes) character, the byte begins with consecutive ones followed by a zero, where the count of initial ones determines the byte's total count for this character. Every subsequent byte up until the expected count must start with '10' in binary.

Evaluating whether an array of integers forms a valid UTF-8 sequence involves checking these prefixes for each integer and ensuring continuation bytes are in the correct format and number.

Examples

Example 1

Input:

data = [197,130,1]

Output:

true

Explanation:

data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2

Input:

data = [235,140,4]

Output:

false

Explanation:

data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Constraints

  • 1 <= data.length <= 2 * 104
  • 0 <= data[i] <= 255

Approach and Intuition

Given the structure of UTF-8 encoding, determining if a list of integers is a valid UTF-8 encoding involves:

  1. Assessing each integer as if it's the start of a new character or a part of the byte sequence of a previous multi-byte character.

  2. For each integer, analyze the number of leading ones in its binary representation:

    • A single leading '0' means this byte declines a single-byte character.
    • '110' suggests the start of a two-byte character, requiring one following byte beginning with '10'.
    • '1110' suggests the start of a three-byte character, needing two subsequent bytes each starting with '10'.
    • '11110' signals a four-byte character, requiring three following bytes that start with '10'.
  3. Verify the continuity of expected bytes after the initial byte:

    • Check the number of two-leading-bit '10' bytes following a multi-byte starter aligns with initial expectations.
    • If at any point the sequence does not match the expected pattern, the function should return false.
  4. Continue this process until you validate every byte in the array.

By ensuring each sequence of bytes adheres to these rules, you can confirm the validity of the UTF-8 encoded sequence.

Example Demonstration

Consider these steps applied to example 1.

  • The input is data = [197, 130, 1]
  • Translating these into binary gives 11000101, 10000010, 00000001
  • First byte 11000101 starts with '110', signaling the start of a two-byte character.
  • The second byte 10000010 matches the expected '10' prefix for the subsequent byte of a two-byte character.
  • The third byte 00000001 starts with '0', and represents a valid single-byte character.

From the pattern analysis, the encoding in example 1 is correct. Conversely, in example 2, even though the starting pattern correctly predicts a three-byte character, one of the continuation bytes fails to match the expected '10' prefix, deeming the sequence invalid. This approach effectively checks adherence to encoding rules.

Solutions

  • Java
java
class Solution {
    public boolean isValidUtf8(int[] data) {
        // Required bytes for the current UTF-8 character
        int bytesToProcess = 0;
        // These masks verify the two most significant bits in each byte.
        int msb1 = 1 << 7;
        int msb2 = 1 << 6;
            
        for(int j = 0; j < data.length; j++) {
            // Begin processing a new UTF-8 character sequence.
            if (bytesToProcess == 0) {
                int leadMask = 1 << 7;
                while ((leadMask & data[j]) != 0) {
                    bytesToProcess += 1;
                    leadMask >>= 1;
                }
                // Single byte characters
                if (bytesToProcess == 0) {
                    continue;
                }
                // If number of bytes is greater than 4 or exactly 1 byte (which is invalid continuation scenario)
                if (bytesToProcess > 4 || bytesToProcess == 1) return false;
            } else {
                // Following bytes must have '10' as the two most significant bits.
                if (!((data[j] & msb1) != 0 && (data[j] & msb2) == 0)) {
                    return false;
                }
            }
            // Decrement the bytes required after checking a byte.
            bytesToProcess--;
        }
        // Ensure all bytes for the last character processed if data array ended in between a character
        return bytesToProcess == 0;
    }
}

This article provides a solution summary to validate if a given array of integers corresponds to a valid UTF-8 encoded string in Java.

The isValidUtf8 method initializes with declaring the number of bytes needed to process the current UTF-8 character (bytesToProcess). It also defines masks to check the two most significant bits (msb1 and msb2) of each byte for validation.

  • The primary loop iterates through every integer in the data array.
  • At the start of each new UTF-8 sequence determined by bytesToProcess being zero, it calculates how many leading bytes are required. This is done using a left shift operation on 1 until it aligns with the position of data[j]. Depending on the results, it increments bytesToProcess by one for each successful shift.
  • If the number of bytes calculated is either more than four or exactly one, the method returns false since it's not a valid UTF-8 leading byte sequence.
  • For other bytes in the sequence, the program checks if they conform to the '10' pattern of continuation bytes. If any byte does not match, the function returns false.
  • After all bytes of a UTF-8 sequence are examined, bytesToProcess is decremented.

Once the loop completes, the method checks if all expected bytes have been processed by ensuring bytesToProcess is zero. If yes, it means the entire sequence is valid; otherwise, it returns false.

This solution ensures efficient validation by directly analyzing the binary patterns that denote UTF-8 sequences, making it suitable for handling a broad array of UTF-8 encoding scenarios.

Comments

No comments yet.