Unicode Normalization Forms

Full Report

Going from unicode to ASCII is required for some applications. How is this done though? This is a document that explains how this is done in the many different forms. Canonical equivalence is when two characters represent the same abstract character but use different codepoints to get there. Compatibility equivalence is similar to this. However, the main difference is that it represents things that have visual differences but mean the same thing. For instance, stylistic changes like italics, linebreaking differences and others. The second case is weirder to normalize and great care must be put into it. The specification talks about 4 different mechanisms for normalizing; they are all just combinations from above where C is for composition and K stands for compatibility. NFD: Canonical Decomposition only. NFC: Canonical Decomposition, followed by Canonical Composition. NFKD: Compatibility Decomposition only. NFKC: Compatibility Decomposition, followed by Canonical Composition In the case of A with a dot on top (212B), the Canonical decomposition route will turn this into two separate characters: an A and a dot. In the Compatibility decomposition, this character remains the same. What's interesting though, is the alternative version of the A with a dot on top of C5. NFD will return the same thing decomposition as before but NFC will return C5. The longer options just do BOTH steps. For example, 2^5 power is made up of the number 2 (0x32) and a raised 5 (2075). Both NFC and NFD decompose this into these two characters. However, NFKD and NFKC turn this into the characters 2 and 5 instead of the raised 5 character. Within NFKC and NFKD, the formatting distinctions are removed from the character. This process is still somewhat confusing and non-obvious to me. None the less, it's interesting to keep this in mind when looking for bugs.

Analysis Summary