Full Report
Going from unicode to ASCII is required for some applications. How is this done though? This is a document that explains how this is done in the many different forms. Canonical equivalence is when two characters represent the same abstract character but use different codepoints to get there. Compatibility equivalence is similar to this. However, the main difference is that it represents things that have visual differences but mean the same thing. For instance, stylistic changes like italics, linebreaking differences and others. The second case is weirder to normalize and great care must be put into it. The specification talks about 4 different mechanisms for normalizing; they are all just combinations from above where C is for composition and K stands for compatibility. NFD: Canonical Decomposition only. NFC: Canonical Decomposition, followed by Canonical Composition. NFKD: Compatibility Decomposition only. NFKC: Compatibility Decomposition, followed by Canonical Composition In the case of A with a dot on top (212B), the Canonical decomposition route will turn this into two separate characters: an A and a dot. In the Compatibility decomposition, this character remains the same. What's interesting though, is the alternative version of the A with a dot on top of C5. NFD will return the same thing decomposition as before but NFC will return C5. The longer options just do BOTH steps. For example, 2^5 power is made up of the number 2 (0x32) and a raised 5 (2075). Both NFC and NFD decompose this into these two characters. However, NFKD and NFKC turn this into the characters 2 and 5 instead of the raised 5 character. Within NFKC and NFKD, the formatting distinctions are removed from the character. This process is still somewhat confusing and non-obvious to me. None the less, it's interesting to keep this in mind when looking for bugs.
Analysis Summary
# Research: UAX #15: Unicode Normalization Forms
## Metadata
- **Authors:** Ken Whistler (Editor)
- **Institution:** The Unicode Consortium
- **Publication:** Unicode Standard Annex (UAX)
- **Date:** 2025-07-30 (Version 17.0.0)
## Abstract
This research and technical specification defines the methodology for normalizing Unicode text. Because Unicode allows for multiple binary representations of the same abstract character—a phenomenon known as "canonical equivalence"—data integrity and security often depend on a uniform representation. This document establishes four distinct Normalization Forms (NFD, NFC, NFKD, and NFKC) to ensure unique binary representations for equivalent strings.
## Research Objective
The primary objective of this documentation is to solve the problem of "representation ambiguity." Specifically, it addresses:
- How can software determine if two different sequences of binary codepoints represent the same human-readable character?
- How can systems convert diverse Unicode input into a stable, comparable format without losing semantic meaning?
## Methodology
### Approach
The research utilizes an algorithmic approach to character decomposition and composition. It categorizes equivalence into two types:
1. **Canonical Equivalence:** Characters that are indistinguishable in meaning and appearance (e.g., "A" with an accent vs. "A" + "combining accent").
2. **Compatibility Equivalence:** Characters that represent the same basic information but have different visual forms or formatting (e.g., superscripts, ligatures, or stylized fonts).
### Dataset/Environment
The scope of the study covers the entirety of the Unicode Character Database (UCD), including mathematical symbols, CJK (Chinese, Japanese, Korean) ideographs, and alphabetic scripts.
### Tools & Technologies
- **Unicode Character Database (UCD):** The underlying repository of character properties.
- **Normalization Algorithms:** Specific recursive processes for decomposition and composition.
- **Conformance Test Suite:** A set of rigorous tests used to verify implementation accuracy.
## Key Findings
### Primary Results
1. **Four Standardization Paths:** The research defines four distinct forms:
- **NFD (Canonical Decomposition):** Breaks characters into base components.
- **NFC (Canonical Decomposition followed by Composition):** Re-combines components into precomposed characters where possible (the most common form for the web).
- **NFKD (Compatibility Decomposition):** Strips formatting (e.g., converting "⁵" to "5").
- **NFKC (Compatibility Decomposition followed by Composition):** Strips formatting and then re-composes results into canonical forms.
2. **Stability Guarantees:** Once a string is normalized to a specific form, it will not change under subsequent applications of that same normalization form.
### Novel Contributions
- **Decomposition Mapping:** An innovative mapping system that distinguishes between essential character identity and "compatibility" formatting.
- **Singleton Exclusion:** Identification of specific characters that should not be composed to avoid round-trip errors.
## Technical Details
The process involves two main stages:
- **Recursive Decomposition:** Following the `Decomposition_Mapping` property until no further decomposition is possible. For NFKD/NFKC, this includes "Compatibility Mappings."
- **Canonical Ordering:** Applying the "Canonical Combining Class" (CCC) to ensure that multiple marks (like accents) are always ordered in the same binary sequence, regardless of input order.
## Practical Implications
### For Security Practitioners
- **Visual Spoofing:** Attackers use different codepoints that look identical to bypass filters (e.g., "admin" with a Cyrillic 'a'). Normalization helps expose these discrepancies.
- **Filter Evasion:** If a Web Application Firewall (WAF) looks for `<script>` but the input is in a compatibility form (e.g., stylized Unicode characters), the WAF might miss it unless it normalizes the input before scanning.
### For Defenders
- **Normalize at the Boundary:** Always normalize Unicode input to a single form (usually NFC or NFKC) before performing security checks, database lookups, or comparisons.
- **Consistency is King:** Ensure that the normalization form used for "checking" matches the form used for "storing" to prevent bypasses.
### For Researchers
- **Bypass Discovery:** Investigating the "delta" between NFKC and NFC often reveals logic flaws in input validation.
- **Homograph Attacks:** Exploring how normalization affects Internationalized Domain Names (IDN).
## Limitations
- **Loss of Information:** NFKD and NFKC are "lossy" because they strip formatting. For example, converting $2^5$ to 25 via NFKC loses the mathematical meaning.
- **Complexity:** Implementing the full specification is computationally intensive and prone to edge-case errors in custom implementations.
## Comparison to Prior Work
Unlike basic ASCII-to-ASCII comparisons, UAX #15 acknowledges that "equality" in modern computing is multi-faceted. It builds upon early encoding standards by accounting for the vast complexity of global scripts and mathematical notation that 7-bit or 8-bit systems ignored.
## Real-world Applications
- **Database Indexing:** Ensuring that a search for "résumé" finds the record regardless of how the user typed the "é".
- **URL Handling:** Standardizing domain names and paths.
- **Username Registration:** Preventing two different users from registering "User1" and "Usеr1" (using a look-alike character).
## Future Work
- **Performance Optimization:** Developing faster normalization algorithms for high-throughput systems.
- **Expanding the Standard:** As new emojis and scripts (like ancient languages) are added, the decomposition mappings must be updated to maintain consistency.
## References
- Unicode Standard Annex #15 (UAX15)
- Unicode Character Database (UCD)
- [https://www.unicode.org/reports/tr15/](https://www.unicode.org/reports/tr15/)