Full Report
This is an update on this previous post on foreign NT hashes where I got things a little wrong by believing the source encoding matters for an NT hash. It doesn’t really, let me show you why. I spent a bit of time exploring further, in particular, I took it down to a test case. Jameel gave me his name as a password in Arabic: Included as a picture because WordPress is messing with my UTF8.“echo d8acd985d98ad9842031|xxd -ps -r” can give it to you straight That’s Jameel1 in Arabic. It’s encoded in UTF8 in most places, whose bytes are:
Analysis Summary
# Research: More On Foreign NT Hashes
## Metadata
- Authors: Dominic White
- Institution: SensePost
- Publication: SensePost Blog
- Date: October 8, 2020
## Abstract
This research serves as a correction and technical update following a previous investigation into Network (NT) hashes derived from non-ASCII (foreign language) passwords. The core objective was to rigorously determine the influence of the source character encoding (e.g., UTF-8, CP1256) on the resultant NT hash, which is fundamentally calculated by applying MD4 to the password encoded in UTF-16 Little Endian (LE). The findings confirm that the source encoding's initial byte representation does not significantly affect the MD4 input if standard tooling (like `hashcat`) is used, yet reveal crucial technical nuances regarding how cracking utilities handle pre-hashing encoding steps, leading to specific recommendations for cracking non-Latin character passwords.
## Research Objective
The primary objective was to investigate and definitively clarify whether the initial encoding (e.g., UTF-8 or Windows Code Page 1256) of a password impacts the final NT hash value, correcting a prior misconception held by the author. Furthermore, the research aimed to establish viable methodologies for cracking these hashes, especially when standard tools fail.
## Methodology
### Approach
The methodology involved comparative testing using a specific Arabic password ("Jameel1" in Arabic script) across different encoding schemes (UTF-8, CP1256) to generate the base NT hash using a controlled environment (Windows 10 VM). Subsequently, the author attempted to crack the resulting hash using standard password cracking tools (`hashcat` and `John the Ripper`) to observe discrepancies in cracking success. Finally, the internal logic of the cracking tools (specifically analyzing `hashcat` source code) was examined to explain observed behavior.
### Dataset/Environment
- **Test Data:** The password "Jameel1" rendered in Arabic script, represented in both UTF-8 (`d8acd985d98ad9842031`) and Windows Code Page 1256 (`cc e3 ed e1 20 31`) byte sequences.
- **Environment:** Windows 10 Virtual Machine for hash generation; Python tools for verification; standard cracking environments for testing.
### Tools & Technologies
- **Hash Generation:** Windows 10 VM, Python (`hashlib`, `binascii`).
- **Password Cracking:** `hashcat` (Mode 1000 for NT Hash) and `John the Ripper`.
- **Analysis:** `xxd` for byte manipulation/display, examination of `hashcat` source code (`module_01000.c`).
## Key Findings
### Primary Results
1. **Encoding Insignificance Post-Encoding:** The final NT hash is derived from MD4 hashing the password *after* it has been consistently converted to **UTF-16 LE**. Therefore, the precise initial source encoding (UTF-8 vs. CP1256) becomes irrelevant if the process correctly yields the same universal UTF-16 LE byte sequence.
2. **Inconsistent Cracking Tool Behavior:** Standard `hashcat` (Mode 1000, NT Hash) failed to crack the hash generated from the Arabic password, whereas `John the Ripper` succeeded.
3. **Hashcat's Encoding Assumption:** Analysis of `hashcat` source code revealed that its NT hash module assumes and applies UTF-16 LE encoding to the supplied cleartext input. However, for the specific Arabic password tested, `hashcat`'s internal UTF-16 LE conversion resulted in a representation that did not match the hash generated directly on Windows.
### Supporting Evidence
- The core process for NT hash generation is confirmed as: **Password $\rightarrow$ UTF-16 LE $\rightarrow$ MD4 Hash**.
- The UTF-16 LE byte representation for Jameel's name (regardless of source encoding in this test) was verified as `2c06 4506 4a06 4406 2000 3100`.
### Novel Contributions
1. **Bypassing Standard NT Hash Module:** The research demonstrated a successful method to crack hashes where standard NT hash cracking (Mode 1000) fails due to encoding mismatches. This involves bypassing the wrapper function by:
a. Manually determining the exact UTF-16 LE byte sequence.
b. Using **MD4 algorithm directly (hashcat Mode 900)**.
c. Applying the raw UTF-16 LE bytes directly using the `--hex-charset` option in `hashcat`.
2. **Custom Character Set Brute-Forcing:** A novel, brute-force mask was constructed in `hashcat` using custom hexadecimal character sets (`-1`, `-2`, `-3`) to targeted search spaces comprising specific UTF-16 LE ranges corresponding to Arabic characters and Latin numbers/space.
## Technical Details
The critical insight is that the success asymmetry between tools hinges on how they handle the UTF-16 LE conversion for non-ASCII characters. When cracking using Mode 900 (raw MD4) with the hex input `63123766334a1bf784d4f123e0f4ab71:$HEX[2c0645064a06440620003100]`, the correct pre-image bytes are supplied directly, bypassing the flawed encoding step within the standard NT hash module emulation.
## Practical Implications
### For Security Practitioners
- When dealing with hashes derived from non-English passwords, standard NT hash cracking profiles may fail deceptively. Practitioners must be prepared to use raw MD4 mode (Mode 900) with hex-encoded UTF-16 LE input if standard cracking fails.
### For Defenders
- Organizations storing user credentials should recognize that Windows inherently uses UTF-16 LE for NT hashes. Any manual generation or comparison of hashes must adhere strictly to this encoding specification to ensure accurate validation against captured hashes.
### For Researchers
- The discrepancy in how `hashcat`'s Mode 1000 handles exotic character sets requires deeper introspection into the specific UTF-16 LE encoding implementation details within the tool's source code, opening questions about its compatibility across various operating system locales.
## Limitations
The research focused heavily on a single test vector (Jameel's name in Arabic). Extrapolating this specific failure mode of `hashcat`'s Mode 1000 to all non-ASCII character sets requires broader testing across different languages and scripts. The resulting custom mask for brute-force is acknowledged as computationally intensive and slow.
## Comparison to Prior Work
This work directly corrects the author's prior assumption that the source encoding dictates the final hash. It builds upon general knowledge of NT hash internals (UTF-16 LE + MD4) by adding empirical evidence regarding specific tool implementation failures in the context of multi-byte foreign characters.
## Real-world Applications
- **Improved Hash Dictionary Attacks:** Enabling successful cracking of NT hashes generated from passwords containing complex non-Latin characters.
- **Implementation Considerations:** When developing internal tools or scripts that generate NT hashes (e.g., Mimikatz output simulation), ensure the internal string is converted to pure UTF-16 LE *before* MD4 hashing.
## Future Work
- Systematically test other complex scripts (e.g., CJK characters) against `hashcat` Mode 1000 to quantify the scope of the encoding translation issue.
- Investigate why `John the Ripper` correctly cracked the hash where `hashcat` failed under Mode 1000, detailing the internal differences in their respective encoding steps.
## References
- Previous SensePost post on NT hashes and encodings.
- `hashcat` plugin development guide (linked source for understanding mode implementation).
- Character set reference tables (UTF-16 properties).