UTF-8 Explained

Full Report

UTF8 is the standard variable length encoding format with over 1M possible characters. There are other standards for UTF like UTF1, UTF16 and UTF32 but this is the most well-used standard. A code point is a decimal representation of the character - such as U+0080. The actual representation in binary is based upon this value. The first byte of UTF8 determines whether this should be 1-4 bytes long. For ASCII, the code points are 0-0x7F, meaning that nothing with a 1 above is valid ASCII. For the first byte for everything else, the amount of ones (followed by a zero) encodes the length. For instance, 110 would be 2 bytes and 11110 would be 4 bytes. Following this information, the next set of bits are encoded into the first byte, such as 5 available bits for the 2 byte sequence. The next set of bytes depends on the previous setting. However, they will always contain a 10 at the beginning of the byte, which is a continuation byte. After this, the next 6 bits can be used for the rest of the code point. As an example, U+00A3 is 11000010 10100011 in binary. It has 2 bytes, which is shown by the first two ones at the front. Then, it has a valid continuation byte and is followed by the rest of the data. When encoding UTF, many of the byte sequences are not valid. Things like missing/unexpected continuation byte, undefined characters and many more are to blame. Additionally, how should this be handled? Should the invalid character be removed, left alone or what? What if we could between character sets? There are so many terrible issues that can come up if we're not careful. Finally, what does it mean to uppercase a unicode character? Some languages operate on a codepoint level while other operate on a character level, which can cause major problems. From a security perspective, there are many things to consider. First, there are visual tricks that can be done with characters like the right-to-left change. Second, if there are different encoders at play then differences between the interpretation can be bad as well. The most important thing here is error handling - should we remove the entire codepoint, the invalid part or just error out? Different implementations do different things. Golang recently listed out some weird issues with their JSON parser, for instance. Similar to case insensitivity, there is also case unfolding. This is more generic than lowercasing and goes throughout the entire unicode codepoint system. There is a list of case folding online as well. Overall, a good exercise into learning about encoding issues!

Analysis Summary