Possibly the most fascinating HTML parser behavior ever

I learned about this tidbit from sirdarckcat. It is in no way new, but the trick is so cute that I just could not resist sharing.


When parsing HTML documents, browsers recognize two methods of specifying tag parameter values: a "bare" form (such as <img src=image.jpg>), which is terminated by angle brackets, whitespaces, and so on; and a quoted form (<img src="image.jpg">) which is terminated only by a matching quote.


Every browser makes the decision by looking at the first non-whitespace character after the name=value separator. If this happens to be a single or a double quotation mark, the second parsing strategy is used; otherwise, the first method is a go. Internet Explorer also recognizes backticks (`) as a faux quote, leading to security flaws in a fair number of HTML filters - but even with this quirk, the behavior is still pretty straightforward. In particular, in the following example, stray quotes will not have any effect on how the tag is interpreted:


<a href=http://www.example.com/?">This text is not a tag parameter anymore.">Click me</a>


But here's the thing: Internet Explorer seems to be doing a substring search for an equals sign followed by a quote anywhere in the parameter name=value pair. Therefore, the following syntax will be parsed in a very different way:


<a href=http://www.example.com/?=">This is still a part of markup indeed!">Click me</a>


It's one of the most unique and surreal HTML parser quirks I am aware of (and it survives to this day in Internet Explorer 9). In principle, it allows any server-side HTML filter to get out of sync with the browser, leading to parameter splitting and tag consumption. In reality, it has a limited practical significance: if your HTML filter is relaxed enough to allow this syntax to go through, it is probably already vulnerable to the abuse of other syntax tricks.

The dreaded curse of openness

Several weeks ago, the chairman of Trend Micro had this to say:


"Android is open-source, which means the hacker can also understand the underlying architecture and source code. We have to give credit to Apple, because they are very careful about it. It's impossible for certain types of viruses to operate on the iPhone."


Now that Kaspersky has, ahem, joined the open source crowd - I worry that hackers may soon be able to understand the operation of anti-virus software as well. And beyond that unthinkable point, only darkness looms.