Full Report
Windows supports Unicode for strings, now-a-days. This article discusses the evolution of string encodings on Windows and the requirement for backward compatibility. Originally, Windows used ANSI encoding. This relied on code pages for languages depending that did not fit within 8-bit ASCII. These code pages were specific to a given language so a Taiwanese message going to a Japanese computer would be rendered differently. In Windows, there are actually two types of code pages. ANSI code pages (the focus of this article) and the OEM Code Page. Eventually, Windows moved over to UTF-16 which uses 16-bits for most characters and 32-bit for rarer ones. While making this change, Windows switched to wide characters UTF-16 on many of their APIs. To remain backward compatible, there are two sets of APIs: A for ANSI and W for UTF-16 wide. The main focus of this post is when wide characters are passed into the ANSI APIs that doesn't exist in the existing code page. Instead of erroring out, the code attempts to do a best fit match back to the current ANSI code page. For instance, the infinity character gets mapped to 8 on code page 1252. Different languages have different quirks. To test this out, they created a tool to show off this functionality. The goal is to abuse this "best fit" feature in order to trick programs on Windows to do weird things. The first instance they found of this was when using the PHP-CGI server. The original vulnerability (from 2012) demonstrated that adding a dash (-) to a query parameter could be used for argument injection, eventually leading to code execution. Using this same exploit method and our "best fit" trick, we can do the same. The URL query parameter ?%ADs will translate into a - on Chinese and Japanese computers. I remember reading the report yet had no idea why this mapping happened. I investigated why this happened but never came to a good conclusion on why. Now I do! What else is affected by this? The Yen and Won symbols on Japanese/Korean Code pages will map to / and \ respectively. Since these are interesting characters for directory traversal, it could be a useful exploit. They found that the Cuckoo Sandbox could be escaped using this technique. The system saw the string as having same characters but the file access APIs in Windows did the "best fit" mapping under the hood. The next target is command line arguments, similar to the PHP bug. In PHP, the function escapeshellarg() is the standard way to prevent command injection and argument injection. In Python, subprocess executes the command after doing some escaping. Under the hood, this will call into CreateProcess with the quoted parameters. If you can control ANY part of the data in the command, then U+FF02 (a full width quote) can be used to bypass this. This is because the functions don't escape it, but the system does the best-fit mapping BEFORE calling the executable. This same attack can work by injecting a \ to remove the escape of another parameter. For instance, using the Won sign to add in a \ alongside a ", leads to the escape of \" on the double quote. Once the best-fit happens on the Won sign, this turns into \\" to void the escaping. They mention that argument splitting via spaces and tabs is fruitful using other characters as well. Neat! ElFinder, which is a PHP application, could be used to pop a shell on by using the tar.exe command with the argument injection. The Open-With feature has a handler table in Windows. Since the filename is part of the argument, it becomes an attack surface. On Microsoft Excel, renaming this file to an argument-splitting payload leads to confusion in the interpretation. This leads to adding arbitrary arguments to excel. You're not safe even if your program is just a simple C program! Using int main will default to the ANSI API usage to get the arguments and environment variables for the call. The compiler adds this in other the hood. A user could also specify wmain if they wanted to remediate this. Environment variables were a huge issue on this as well, leading to LFI and a WAF bypass in some PHP things. Disclosure of these bugs was difficult. Developers thought the bug was in Windows while Windows said they needed to maintain backward compatibility. You can use the beta UTF-8 package on Windows as a user. Additionally, use safe APIs instead of shell commands when possible. Is the dawn of a new bug class on Windows? It appears so.
Analysis Summary
# Tool/Technique: WorstFit (Windows ANSI Best-Fit Exploitation)
## Overview
WorstFit is a technique that exploits the "Best-Fit" character mapping behavior in Windows' legacy ANSI APIs. When Unicode characters that do not exist in a specific legacy code page are passed to an ANSI API, Windows attempts to map them to the "closest" available character instead of failing. Attackers can use specific Unicode characters (transformers) that map to sensitive delimiters like quotes (`"`), backslashes (`\`), or dashes (`-`) to bypass security filters and achieve Path Traversal or Remote Code Execution (RCE).
## Technical Details
- **Type**: Technique (Exploitation of OS-level character encoding conversion)
- **Platform**: Windows (specifically versions using legacy ANSI code pages/APIs)
- **Capabilities**: Bypassing string sanitization, Argument Injection, Path Traversal, Environment Variable Confusion, and Sandbox Escape.
- **First Seen**: Research published at Black Hat Europe 2024; historically linked to a 2012 PHP-CGI vulnerability (CVE-2012-1823).
## MITRE ATT&CK Mapping
- **[TA0001 - Initial Access]**
- **[T1190 - Exploit Public-Facing Application]** (e.g., PHP-CGI, ElFinder)
- **[TA0005 - Defense Evasion]**
- **[T1036 - Masquerading]** (Using look-alike Unicode characters to represent control characters)
- **[T1211 - Exploitation for Defense Evasion]**
- **[TA0002 - Execution]**
- **[T1059 - Command and Scripting Interpreter]**
- **[T1203 - Exploitation for Client Execution]** (e.g., 1-Click Excel exploits)
## Functionality
### Core Capabilities
- **Filename Smuggling**: Using characters like the Fullwidth Period (`.`) or currency symbols (`¥`, `₩`) to bypass path filters. These map to `.` and `\` or `/`, enabling path traversal even if the application filters standard ASCII slash/dot characters.
- **Argument Splitting**: Utilizing characters like Fullwidth Quotation Mark (`"` U+FF02) which maps to a standard double quote (`"`). This allows an attacker to "break out" of quoted command-line arguments.
- **Environment Variable Confusion**: Exploiting `GetEnvironmentVariableA` or `getenv` to retrieve "Best-Fit" versions of variables, potentially bypassing WAFs or security checks that only inspect standard ASCII inputs.
### Advanced Features
- **Legacy API Hooking**: Exploits the fact that `int main()` in C/C++ often defaults to ANSI (via `GetCommandLineA`), making even simple programs vulnerable if they don't use `wmain`.
- **Cross-Code Page Variability**: The "Best-Fit" mapping changes based on the system's locale (e.g., Code Page 932 for Japanese, 949 for Korean), making the exploit payload dependent on the target's regional settings.
## Indicators of Compromise
- **File Names**: Files containing high-range Unicode characters like `\u00A5` (Yen), `\u20A9` (Won), or `\uFF02` (Fullwidth Quote).
- **Behavioral Indicators**:
- Subprocess execution where command-line arguments contain unexpected Unicode characters that transform into shell metacharacters.
- Web requests with URL-encoded Unicode (e.g., `%AD` for soft hyphen mapping to `-`).
- Unexpected directory traversal patterns originating from inputs that passed initial "safe character" validation.
## Associated Threat Actors
- While this is a newly documented general technique by researchers **Orange Tsai** and **splitline**, historical variants were used in exploits against PHP-CGI (CVE-2012-1823).
## Detection Methods
- **Behavioral Detection**: Monitor for processes spawning shells or sensitive binaries (like `tar.exe`, `Excel.exe`, `cmd.exe`) where the command line contains non-ASCII Unicode characters.
- **Input Validation Audit**: Detect inputs containing "Fullwidth" Unicode variants (U+FF00 to U+FFEF) in web or application logs.
- **Detection Logic**: Verify if the application is using ANSI versions of Windows APIs (ending in 'A') while processing user-supplied Unicode data.
## Mitigation Strategies
- **Developer Level**:
- Use **Wide Character APIs** (e.g., `CreateFileW`, `_wgetenv`, `wmain`) exclusively.
- Avoid calling shell commands via `system()` or `popen()`.
- **System Level**:
- Enable the Windows Beta feature: **"Use Unicode UTF-8 for worldwide language support"** in Region settings.
- Keep software (like PHP, Python, OpenSSL) updated to versions that have addressed specific Best-Fit vulnerabilities.
- **Hardening**: Phase out the use of legacy ANSI code pages in favor of UTF-16 internally and UTF-8 externally.
## Related Tools/Techniques
- **CVE-2024-4577**: PHP-CGI Argument Injection (2024)
- **CVE-2012-1823**: Original PHP-CGI vulnerability.
- **Path Traversal**: General technique enhanced by WorstFit.
- **Argument Injection**: General technique enhanced by WorstFit.