Recently, I got curious about HTML parsing—you know, that thing browsers do billions of times a day that we all take completely for granted.
How hard could it be? I thought, like every programmer before me who has wandered into this particular circle of hell.
Turns out: very hard. HTML parsing isn’t just about recognizing <div>
tags and calling it a day. It’s a complex mess of state machines, error recovery, and edge cases with countless bizarre scenarios.
The good news? HTML parsing is a solved problem. It’s been thoroughly documented in standards like the WHATWG and W3C spec. I went with the WHATWG HTML Living Standard1 because it’s what all modern browsers actually implement, and it’s actively maintained. The WHATWG spec defines a parsing algorithm so intricate that implementing it correctly is...