Recently, I got curious about HTML parsing—you know, that thing browsers do billions of times a day that we all take completely for granted.
How hard could it be? I thought, like every programmer before me who has wandered into this particular circle of hell.
Turns out: very hard. HTML parsing isn’t just about recognizing <div>
tags and calling it a day. It’s a complex mess of state machines, error recovery, and edge cases with countless bizarre scenarios.
The good news? HTML parsing is a solved problem. It’s been thoroughly documented in standards like the WHATWG and W3C spec. I went with the WHATWG HTML Living Standard1 because it’s what all modern browsers actually implement, and it’s actively maintained. The WHATWG spec defines a parsing algorithm so intricate that implementing it correctly is genuinely challenging.
So naturally, I did it in Kotlin. Because why make life easy?
Note: This article is being written while the implementation is still ongoing, so some details and approaches may evolve as I continue learning and refining the parser.
Why HTML Parsing is Actually Terrible
Before diving into the implementation details, here’s what I discovered after spending way too many weekends reading the WHATWG spec:
The Fundamental Problem
HTML was designed to be written by humans, which means it was designed to be forgiving. Missing a closing tag? No problem! Forgot to escape that &
character? We’ll figure it out! Nested a <p>
inside another <p>
? Sure, why not—we’ll just magically close the first one!
This sounds great in theory, but in practice, it means the parser has to be psychic. It needs to handle:
- The 80+ tokenization states: Each with its own special rules for what characters mean what
- The 20+ tree construction modes: Because apparently, parsing
<script>
content is different from parsing<style>
content is different from parsing regular content - Character entity madness: There are over 2,000 named character entities, plus numeric ones, plus special rules for when they’re malformed
- And much more: I haven’t even reached all the edge cases yet
Two-Phase Parsing
The WHATWG spec splits parsing into two phases, presumably to prevent implementers from going completely insane:
- Tokenization: Turn the character soup into discrete tokens
- Tree Construction: Turn those tokens into a document tree
This separation makes sense—tokenization handles all the character-level complexity, while tree construction handles all the structural complexity.
Why I Chose Kotlin
Now, you might be wondering: why Kotlin? Isn’t this the kind of thing you’d normally write in C++ or Rust, languages that are designed to make you suffer in more interesting ways?
Here’s the thing: I only know Kotlin really well, and frankly, I didn’t want to suffer through learning another language’s syntax on top of implementing the complexity of an HTML parser. Fortunately, Kotlin turned out to be perfect for the job—it’s expressive enough that I can write state machines using DSLs that read almost like the English spec itself. It’s like having a magic translator that converts WHATWG bureaucratic prose directly into working code.
Let me show you what I mean:
Implementation Approach
The WHATWG spec defines parsing as a state machine with precise rules for each state transition. Here’s where Kotlin’s expressiveness really shines—I can model these states in a way that directly mirrors the specification.
Modeling Tokens
First, I need to represent the different token types the spec defines. Kotlin’s sealed interfaces handle this perfectly:
internal sealed interface Token {
data class StartTag(
val name: String,
val selfClosing: Boolean = false,
val attributes: MutableMap<String, String> = LinkedHashMap()
) : Token {
fun addAttribute(name: String, value: String) {
if (!attributes.containsKey(name)) {
attributes[name.lowercase()] = value
}
}
}
data object DocType : Token
data class EndTag(val name: String) : Token
data class Comment(val data: String) : Token
data class Character(val data: Char) : Token
}
The type system ensures I can’t accidentally mix up token types, and Kotlin’s when
expressions force me to handle every possible case.
From Spec to Code
Here’s what makes Kotlin particularly good for this: I can translate the WHATWG prose almost directly into code. Take the “Character reference state”2 from section 13.2.5.72 of the spec:
Set the temporary buffer to the empty string. Append a U+0026 AMPERSAND (&) character to the temporary buffer. Consume the next input character:
- ASCII alphanumeric
- Reconsume in the named character reference state.
- U+0023 NUMBER SIGN (#)
- Append the current input character to the temporary buffer. Switch to the numeric character reference state.
- Anything else
- Flush code points consumed as a character reference. Reconsume in the return state.
This translates almost word-for-word into Kotlin:
internal class CharacterReference(
private val returnState: TokenizationState,
) : TokenizationState {
override fun consume(codePoint: Int, tokenizer: Tokenizer): StateTransition {
tokenizer.tempBuffer.clear()
tokenizer.tempBuffer.append(Chars.Ampersand.char.code.toChar())
return when {
codePoint.isAsciiAlphaNumeric() -> reconsumeCurrentChar {
NamedCharacterReference(returnState)
}
codePoint.IS(Chars.NumberSign) -> {
tokenizer.tempBuffer.append(codePoint.toChar())
consumeNextChar { NumericCharacterReference(returnState) }
}
else -> {
returnState.flushCodePoints(tokenizer)
reconsumeCurrentChar { returnState }
}
}
}
}
Current Implementation Progress
I’ve made significant progress on the tokenizer, implementing about 24 tokenization states so far. The parser can handle:
Core Tokenization:
- Basic tag parsing (open/close tags, tag names, attributes)
- Character references (both named entities like
&
and numeric ones likeA
) - Self-closing tags
- Attribute parsing (quoted, unquoted, with proper error handling)
Tree Construction:
- All major insertion modes (InBody, InHead, InTable, etc.)
- Element stack management
- Basic DOM tree building
The implementation follows the WHATWG spec closely, but I’m being selective about which edge cases to implement. I’m not planning to handle every possible malformed HTML scenario—frankly, I’m not sure exactly how I’ll use this parser yet. Maybe as an agnostic parsing library for custom rendering, maybe for data extraction, definitely not for building a browser (I’m not that ambitious).
What’s Missing:
- Script and style content parsing (complex and I may not need them)
- Some of the more obscure tokenization states
- Full error recovery for severely malformed HTML
The core functionality works well enough to parse most real-world HTML documents. Whether I implement the remaining edge cases depends on what use cases emerge.
Character Entities: Where Sanity Goes to Die
Remember when I mentioned that there are over 2,000 named character entities? Well, buckle up, because this is where HTML parsing goes from “complicated” to “actively malicious.”
Think &
and <
are simple? Think again. The spec has special rules for:
- Ambiguous ampersands: What happens when you write
¬it;
and there’s both¬
and¬it;
in the entity table? - Missing semicolons:
&
without a semicolon is sometimes valid, but only in certain contexts - Invalid numeric ranges:
�
is outside the valid Unicode range, so it becomes the replacement character - Legacy compatibility: Some malformed entities are “fixed” for backwards compatibility with ancient HTML
Named Character References
Let’s say you’re parsing this HTML: <div title="He said ¬ today">
. When you hit that ampersand, you need to figure out if it’s an entity. But here’s the problem: there are multiple entities that start with “not”:
¬
→ ¬ (logical NOT symbol)∉
→ ∉ (not an element of)∌
→ ∌ (does not contain as member)
The WHATWG spec says: be greedy. Consume as many characters as possible to form the longest valid entity. This is called “maximal munch”3.
The Trie Structure
Instead of scanning through 2,000+ entities linearly, I built a trie that looks like this:
Each node in the trie can serve multiple roles:
- Intermediate path nodes: Like “A” → “E” → “L” → “i”, which form part of longer entity names
- Terminal nodes with codepoints: Nodes that mark complete entities (like “AElig” → Æ), indicated by green nodes in the diagram
- Both simultaneously: A node can be both a valid endpoint AND continue to longer entities (like “AMP” which is complete but can extend to “AMP;”)
This dual nature is crucial for the WHATWG spec’s “maximal munch” behavior. For example, when parsing &
, the algorithm will:
- Find a valid match at “AMP” (stores this as a potential result)
- Continue and find an even better match at “AMP;” (updates the result)
- Return the longest valid match found
Both &
and &
exist as separate entries in the NamedChars
map, so both paths have their own codepoints
array set in the trie nodes.
Following the Trail
Here’s how the algorithm works when parsing ¬ today
:
var currentNode = EntityTrie.root
var lastMatch: Triple<String, IntArray, Int>? = null
while (true) {
val char = inputBuffer[position] ?: break
val nextNode = currentNode.children[char] ?: break
currentNode = nextNode
consumedChars += char
// If this node has codepoints, it's a complete entity
currentNode.codepoints?.let {
lastMatch = Triple(consumedChars, it, position)
}
}
Let’s trace through the steps:
When Paths Get Longer
Here’s a more complex example showing how the algorithm handles invalid entities. If we parsed ¬it;
(note: this isn’t a real HTML entity, but ¬
is):
The key insight: we greedily consume characters as long as valid paths exist, but always remember the longest complete entity we’ve seen. In this case, even though ¬it
doesn’t exist as an entity, the algorithm successfully falls back to ¬ → ¬
(the longest valid match it found) and leaves it;
for further parsing, Resulting in the string: ¬it;
.
This demonstrates how your implementation handles malformed or non-existent entities gracefully - it doesn’t fail completely but instead uses the best partial match available.
Context-Sensitive Edge Cases
The spec has a weird rule for attribute values. Consider:
<a href="?foo=1&"> <!-- Treat & as literal text -->
<a href="?foo=1&"> <!-- Decode & to & -->
My implementation handles this context sensitivity:
if (returnState.isAttributeState() && !match.endsWith(';') &&
nextChar?.isLetterOrDigit() == true) {
// Legacy behavior: don't decode
tokenizer.currentAttributeValue.append(match)
}
This whole system works like following a GPS route—you keep driving as long as there’s a valid path ahead, but you remember the last landmark you passed in case you need to turn around.
Tree Construction: The Fun Isn’t Over Yet
Okay, so you’ve survived the tokenization phase. Your HTML has been chopped up into a nice stream of tokens. Surely the hard part is over, right?
Right?
Nope! Now we get to build a tree from those tokens, and this is where HTML’s “forgiving” nature really shines.
I’ve just started implementing the tree construction phase, and I’m already facing some decisions. The WHATWG spec defines not just how to build a tree, but also a full DOM API with all the bells and whistles. The question is: should I implement the entire DOM specification from scratch and go completely crazy, or keep it simple for now with a minimal tree structure?
For now, I’m taking the simple approach.
The Current Architecture
My tree builder follows the insertion mode pattern from the spec:
interface InsertionMode {
fun process(token: Token, treeBuilder: TreeBuilder)
}
class TreeBuilder {
private var insertionMode: InsertionMode = Initial
private val openElements = mutableListOf<TreeNode>()
val document: TreeNode = TreeNode.createDocument()
}
The tree construction algorithm has to handle some interesting scenarios:
- Implicit tag closing:
<p>Hello<div>World</div>
automatically closes the<p>
before opening the<div>
- The stack of open elements: Track exactly which elements are open and in what order
- Multiple insertion modes: Initial, BeforeHTML, InHead, InBody, etc.—each with different rules
The TreeNode: Intentionally Simple
Instead of implementing a full DOM, I went with a minimal tree structure:
data class TreeNode(
val type: NodeType,
val name: String? = null, // Tag name
var data: String? = null, // Text content
val attributes: MutableMap<String, String>, // Attributes
val children: MutableList<TreeNode>, // Child nodes
var parent: TreeNode? = null // Parent reference
)
No DOM methods like querySelector
or addEventListener
—just the pure tree structure. This keeps things focused on the parsing problem rather than getting lost in DOM API complexity.
Smart Details: Text Node Merging
One thing I did implement is automatic text node merging:
fun insertText(data: String) {
val lastChild = currentNode.lastChild
if (lastChild?.type == NodeType.TEXT) {
// Merge with previous text node
lastChild.data = (lastChild.data ?: "") + data
} else {
val textNode = TreeNode.createText(data)
currentNode.appendChild(textNode)
}
}
So parsing <p>Hello World</p>
creates one text node, not two separate ones.
Current Status and the Big Question
I have the basic insertion modes stubbed out and the core tree building logic working. But I keep coming back to the same question: how far should I take this?
The WHATWG spec has hundreds of pages on DOM APIs. Do I implement Element.querySelector()
? What about MutationObserver
? Event handling? CSS selector matching?
For now, I’m resisting the urge to build a full browser engine. The tree structure I have is sufficient for parsing HTML and representing the document structure. If I need more functionality later, I can always build a DOM API layer on top of this foundation.
The tree construction phase is turning out to be much more straightforward than tokenization—at least so far.
Testing: How to Know You’re Not Completely Wrong
Here’s the thing about HTML parsing: you can’t just write some tests and call it a day. The edge cases are so numerous and bizarre that you need industrial-strength test coverage. There are two main test suites for this:
Web Platform Tests (WPT)4 - The official W3C test suite used by all major browsers. Comprehensive but complex to integrate.
html5lib5 - A more focused test suite specifically for HTML parsing with a simpler JSON format.
I went with html5lib for tokenizer tests since it’s much easier to understand and implement. The test format is straightforward:
{
"input": "Æ",
"description": "Named entity: AElig without a semi-colon",
"output": [
[
"Character",
"\u00c6"
]
],
"errors": [
{ "code": "missing-semicolon-after-character-reference", "line": 1, "col": 7 }
]
}
Building a Test Generator
Rather than manually writing 5,600+ test cases, I built a test generator that fetches the JSON from html5lib and converts it to actual Kotlin test cases:
private fun generateTest(testCase: TestCase, testName: String): String {
return buildString {
appendLine(" @Test")
appendLine(" fun ${testName}() = assertTokenization(")
appendLine(" input = \"${testCase.input.escapeString()}\",")
appendLine(" expectedTokens = listOf(")
appendLine(generateExpectedTokens(testCase.output))
appendLine(" )")
appendLine(" )")
}
}
The generator preserves test case names, inputs, and expected outputs while converting them to idiomatic Kotlin tests. For example, a JSON test like:
{
"description": "Quoted attribute followed by permitted /",
"input": "<br a='b'/>",
"output": [
[
"StartTag",
"br",
{
"a": "b"
},
true
]
]
}
Becomes:
@Test
fun `Quoted attribute followed by permitted slash`() = assertTokenization(
input = "<br a='b'/>",
expectedTokens = listOf(
startTag(
name = "br",
attributes = mapOf(
"a" to "b",
),
selfClosing = true
),
)
)
This approach gave me confidence in the tokenizer implementation—having 5,600+ tests from the official html5lib suite means I can catch edge cases I never would have thought of. Plus, if there’s ever an issue with my test expectations, I can easily trace back to the original JSON test case to verify the correct behavior.
Someone, somewhere, tried to use �
in their HTML (Unicode only goes up to 0x10FFFF
). The fact that this edge case has its own test tells you everything you need to know about the state of HTML on the internet.
Reflections
Building this parser taught me that the WHATWG spec isn’t academic theory—it’s web archaeology. Every bizarre rule exists because some website from 1999 would break without it. The 80+ tokenizer states that initially seemed overwhelming are actually elegant: each state knows exactly what it’s responsible for and nothing else. Kotlin’s expressiveness was perfect for this—I could translate spec prose almost word-for-word into working code.
The html5lib test suite was both humbling and educational. Choosing Kotlin Multiplatform turned out well too—HTML parsing is pure algorithms with no platform dependencies, so getting JVM, JavaScript, and native targets for free was a nice bonus.
Should you use this parser? Definitely not yet—it’s still a learning project that I haven’t made public. I’m working on tidying it up and getting it to produce a basic DOM tree before I share the code. Either way, implementing it has been educational—I understand how browsers handle malformed HTML much better now, and I’m content with having built a solid foundation for whatever comes next.
The implementation is still in development and not yet public. It currently passes all of the html5lib tokenization tests and handles basic tree construction, but I want to get it producing proper DOM trees before sharing the code.
-
WHATWG HTML Living Standard - Parsing - The definitive specification for HTML parsing that all modern browsers implement ↩
-
Character Reference State - WHATWG HTML specification section defining how to handle character references (entities) during tokenization ↩
-
Maximal munch is a term from compiler theory meaning “always consume the longest valid match” when multiple parsing options are available ↩
-
Web Platform Tests - A cross-browser test suite for web platform specifications, ensuring consistent implementation across browsers ↩
-
html5lib Test Suite - Comprehensive test cases covering every edge case in HTML parsing, used by major browser implementations ↩