Tag Validator

Updated on 26 June, 2025
Tag Validator header image

Problem Statement

To validate XML or HTML-like code snippets, a tag validator is implemented which checks the validity based on specific rules. The code needs to be wrapped inside a valid closed tag which means it starts and ends with matching tag elements. Tags have a defined format such as <TAG_NAME> for opening and </TAG_NAME> for closing, where TAG_NAME is a string that includes only uppercase letters and must be between 1 to 9 characters in length.

The content inside these tags, known as TAG_CONTENT, can include nested valid tags, CDATA sections, and other allowable characters, but it should not have unmatched tag symbols or improperly nested or named tags. Further, CDATA sections, represented as <![CDATA[CDATA_CONTENT]]>, allow any characters inside, ensuring they are treated as textual content and not parsed further by the validator.

Validating such code snippets involves checking tag formats, ensuring proper nesting and matching of tags, and handling of CDATA sections correctly, among other detailed validations contained within the scope of the rules defined.

Examples

Example 1

Input:

code = "
This is the first line ]]>
"

Output:

true

Explanation:

The code is wrapped in a closed tag : 
and
. The TAG_NAME is valid, the TAG_CONTENT consists of some characters and cdata. Although CDATA_CONTENT has an unmatched start tag with invalid TAG_NAME, it should be considered as plain text, not parsed as a tag. So TAG_CONTENT is valid, and then the code is valid. Thus return true.

Example 2

Input:

code = "
>> ![cdata[]] ]>]]>]]>>]
"

Output:

true

Explanation:

We first separate the code into : start_tag|tag_content|end_tag.
start_tag -> "
" end_tag -> "
" tag_content could also be separated into : text1|cdata|text2. text1 -> ">> ![cdata[]] " cdata -> "]>]]>", where the CDATA_CONTENT is "
]>" text2 -> "]]>>]" The reason why start_tag is NOT "
>>" is because of the rule 6. The reason why cdata is NOT "]>]]>]]>" is because of the rule 7.

Example 3

Input:

code = "   "

Output:

false

Explanation:

Constraints

  • 1 <= code.length <= 500
  • code consists of English letters, digits, '<', '>', '/', '!', '[', ']', '.', and ' '.

Approach and Intuition

  1. Understanding Valid Tags: A tag is considered valid if it opens and closes correctly with valid tag names, such as <DIV>content</DIV>. The tag name should be uppercase and of a length between 1 and 9 characters.

  2. Valid Tag Content: Content within a tag can also have nested tags, which themselves need to be valid. CDATA sections represented by <![CDATA[...]]> can hold any form of textual content, allowing characters usually restricted or parsed differently by XML/HTML parsers.

  3. Parsing the String: The main approach is to parse through the code string and use a stack to manage opening and closing tags. For each character, check if it starts a tag, ends a tag, or begins a CDATA section.

  4. Handling CDATA: If a <![CDATA[ is encountered, the subsequent characters till the first ]]> are collected as CDATA content and treated as regular text.

  5. Validation Using Stacks: Using a stack helps in managing tags, where an opening tag is pushed into the stack and expected to be closed with a corresponding closing tag. If at any point, a closing tag doesn't match the latest opening tag on the stack, or if after processing the entire string the stack isn't empty, the snippet is invalid.

In practical terms, each of the provided examples gives a perspective on how these definitions play out:

  • Example 1 validates true as the tags enclose valid CDATA and content.
  • Example 2 also validates true, showcasing how nested CDATA is handled and ignored as far as tag validation goes.
  • Example 3 shows an invalid case due to unbalanced tag nesting, where <A> is closed before <B> has been closed, demonstrating the significance of maintaining order and hierarchy in nesting tags.

Solutions

  • Java
java
import java.util.Stack;

public class Parser {
    Stack<String> nameStack = new Stack<>();
    boolean hasTag = false;

    public boolean checkTagName(String tagName, boolean isClosing) {
        if (isClosing) {
            if (!nameStack.isEmpty() && nameStack.peek().equals(tagName))
                nameStack.pop();
            else
                return false;
        } else {
            hasTag = true;
            nameStack.push(tagName);
        }
        return true;
    }

    public boolean validate(String htmlCode) {
        String regexPattern = "<[A-Z]{0,9}>([^<]*(<((\\/?[A-Z]{1,9}>)|(!\\[CDATA\\[(.*?)]]>)))?)*";
        if (!Pattern.matches(regexPattern, htmlCode))
            return false;
        for (int i = 0; i < htmlCode.length(); i++) {
            boolean isEndingTag = false;
            if (nameStack.isEmpty() && hasTag)
                return false;
            if (htmlCode.charAt(i) == '<') {
                if (htmlCode.charAt(i + 1) == '!') {
                    i = htmlCode.indexOf("]]>", i + 1);
                    continue;
                }
                if (htmlCode.charAt(i + 1) == '/') {
                    i++;
                    isEndingTag = true;
                }
                int tagClose = htmlCode.indexOf('>', i + 1);
                if (tagClose < 0 || !checkTagName(htmlCode.substring(i + 1, tagClose), isEndingTag))
                    return false;
                i = tagClose;
            }
        }
        return nameStack.isEmpty();
    }
}

The Java program provided defines a class Parser with methods to validate an HTML-like string to ensure it adheres to a specified format. This validation is crucial for ensuring that the structure of the tag-based code segments is correct and that each tag appropriately opens and closes.

The core elements of the parsing mechanism involve:

  • Using a Stack<String> named nameStack to track open tags, ensuring that tags are properly nested and closed in the order they were opened.

  • hasTag boolean to identify if any valid tag has been encountered, which aids in the validation of empty strings or strings without proper tags.

  • checkTagName method that deals with pushing tag names onto the stack when opening tags are identified and popping them when corresponding closing tags are found. The method returns false if a mismatch or improper closing is detected.

  • validate method which:

    1. Checks if the HTML-like string matches the general pattern for a well-formed segment using regular expressions.
    2. Iterates through the string character by character to discern and process tags, jumping indices appropriately to handle nested tags and CDATA sections. It leverages checkTagName for each detected tag.

The validation process is stringent in that it verifies:

  • Correct nesting of tags using a stack data structure.
  • Absence of unmatched or dangling tags.
  • Overall structure against a defined regular expression pattern which encompasses rules for tag names and nested allowable content.

The program is efficient in detecting malformed structures quickly and is mainly intended for strings that represent well-defined, uppercase tag names of lengths 1 to 9, making it suitable for simplified HTML-like language constructs or specific XML-based configurations.