Full Report
In the wide world of technology, there are many different byte encoding standards such as UTF-8, UTF-16 and Big5. In particular, we need a way to map bytes to characters. When we do this, there is an encoding on the server side from one charset to bytes and another charset to decode it, where most of the times these are the same. This article finds a crazy bug in modern browsers that can lead to XSS and other issues. In HTTP, the charset attribute can be set on the Content-Type header, the meta tag within the HTML and finally the byte order mark of U+FeFF. If the browser can't determine the charset from the header, then it does some auto detection on it. The idea of the attack is to get data sent to the browser as one character type but then interpreted as another when the browser receives it. Back in the day, this was used to get XSS via a UTF-7 encoding on Google. Modern browsers have banned UTF-7 though and most charsets aren't helpful for smuggling in characters, except one. The ISO-2022-JP is a Japanese character encoding that must be supported. If a byte sequence of 0x1b 0x28 0x42 tell the charset to decode the next set of bytes as ASCII instead of the JP charset. What's even better, is that Chrome and Firefox will both autodetect this encoding for us to cause havoc. The first attack they do is via negating backslash escaping. This requires having input at the top and bottom of the file, where double quotes are backslash escaped on the server side, preventing XSS. Once the escape sequence is added, the browser will switch from ASCII (the default in the mode) to Japanese character set. When this happens, most of the standard ASCII is the same besides two chars: 0x5C and 0x7E. 0x5C is the yen character in this charset instead of the backslash! So, instead of the browser seeing a backslash in this encoding, it sees the yen! Now, we can execute arbitrary JS because we escaped the string. That's pretty neat! The second technique is when data is controlled in two separate locations within an HTML tag, such as an attribute or plaintext section. The idea is to switch from ASCII to the JP one in the attribute. Then, within the plaintext, switch it back to ASCII. Since the double quote of the attribute was effectively skipped with the charset change, we're now INSIDE of the attribute with this data. After this, an attribute (because of the closing double quote) will be treated as HTML when it shouldn't be. Pretty neat! How do we trigger the browser to see this encoding? According to the authors, direct control over the charset or via a meta tag is nice. If the charset isn't added, the auto detection finds it super easy, according to the author. I'm guessing that it looks for the escape sequences is all. Overall, a great post on the complications of character encodings in the browser. The browsers ability to help the web page has once again added a security mishap to the world.
Analysis Summary
# Vulnerability: Charset Differential XSS via ISO-2022-JP Auto-detection
## CVE Details
- **CVE ID**: N/A (Design flaw in browser auto-detection and encoding standards)
- **CVSS Score**: Estimated 7.5 (High) - based on XSS impact via network vector.
- **CWE**: CWE-116 (Improper Encoding or Escaping of Output), CWE-172 (Encoding Errors)
## Affected Systems
- **Products**: Modern Web Browsers (specifically Google Chrome and Mozilla Firefox).
- **Versions**: Current versions as of mid-2024.
- **Configurations**: Web applications that serve HTML content without an explicit `charset` attribute in the `Content-Type` header or a `<meta>` tag, relying on browser auto-detection.
## Vulnerability Description
Modern browsers use "encoding sniffing" algorithms to guess the character set of a page if it isn't explicitly defined. The researchers found that browsers will auto-detect **ISO-2022-JP**, a Japanese character set. This encoding uses escape sequences (e.g., `0x1b 0x28 0x42`) to switch between ASCII and Japanese modes.
Two primary attack vectors exist:
1. **Backslash Negation**: In ISO-2022-JP's Japanese mode, the byte `0x5C` is interpreted as a **Yen symbol (¥)** instead of a **Backslash (\)**. If a server performs backslash-escaping to prevent XSS (e.g., `\"`), an attacker can switch the charset to ISO-2022-JP so the browser sees a Yen sign followed by a literal quote, effectively "breaking out" of the string.
2. **Attribute Smuggling**: By switching to a 2-byte-per-character mode within an attribute and switching back in a later plaintext section, an attacker can cause the browser to "consume" the closing quote of an attribute. This shifts the subsequent HTML (like a `src` attribute) into the attribute value and allows the attacker to inject new HTML attributes like `onerror`.
## Exploitation
- **Status**: PoC available/Research published.
- **Complexity**: Medium (Requires specific injection points in two locations or before/after server-side escaping).
- **Attack Vector**: Network (Remote).
## Impact
- **Confidentiality**: High (Can steal session cookies/sensitive data via XSS).
- **Integrity**: High (Can modify page content or perform actions on behalf of the user).
- **Availability**: None.
## Remediation
### Patches
- There is no specific "patch" as this is a behavior of the HTML specification's encoding sniffing. However, researchers have proposed that browsers should disable auto-detection for ISO-2022-JP.
### Workarounds
- **Strict Charset Declaration**: Always include an explicit charset in the HTTP header: `Content-Type: text/html; charset=utf-8`.
- **Meta Tags**: Use `<meta charset="utf-8">` as the very first element in the `<head>`.
- **Content Security Policy (CSP)**: Implement a strong CSP to prevent the execution of inline scripts or unauthorized external scripts.
## Detection
- **Indicators of Compromise**: Presence of the ISO-2022-JP escape sequence bytes (`0x1b 0x28 0x42`, `0x1b 0x24 0x42`, etc.) in user-supplied input or HTTP logs.
- **Detection Methods**: Security scanners should flag any HTTP responses missing a `charset` directive. WAFs can be configured to block ISO-2022-JP escape sequences in form data/URL parameters.
## References
- **Original Research**: [https://www.sonarsource.com/blog/encoding-differentials-why-charset-matters/](https://www.sonarsource.com/blog/encoding-differentials-why-charset-matters/)
- **Conference Presentation**: [https://www.youtube.com/watch?v=z-ug2dwcSz8](https://www.youtube.com/watch?v=z-ug2dwcSz8)
- **HTML Specification**: [https://html.spec.whatwg.org/#character-encodings](https://html.spec.whatwg.org/#character-encodings)