January 09, 2016 - UTF-8 and Invalid Encoding

January 09, 2016

UTF-8 and Invalid Encoding

I think I tend to end up dealing directly with fonts and font related issues ore than the average programmer. Mostly due to the fact that I tend to work more with low-level interfaces where fonts aren't already available, but also partially because I work with other languages occasionally. As the somewhat proud author of libpsf, a library for loading psf fonts, I think I know a bit about fonts. So when I get blamed for an issue with fonts, it doesn't make me real happy. However, it does provide an opportunity to talk about something different, encodings.

Let's consider the following hypothetical situation. You are presented with a source file containing Korean text, however, once compiled, some of the characters in the document display incorrectly. Some characters show up as black boxes, and some show up as Chinese characters, others show as Korean characters, but not the intended character. However, the majority of characters display correctly. You contact the author of the source document and inform the of the issue, their response however, is that the issue is with your fonts, not the document.
Of course, this answer is not acceptable, and not correct, but how would you go about demonstrating this? The first thing you would do is check to ensure that your fonts are indeed sufficient to display the characters. Once this is done, you might wonder if maybe the encoding you are using to display is not correct. And now you are finally almost on the right track. There are really only 5 or 6 encodings valid for Korean characters, one of which is for North Korea, so we can rule that out. 3 are UTF-8, UTF-16, and UTF-32, the others are rarely used and very different from UTF, but you try them anyway and they decode the text even more poorly than UTF-8, with all characters interpreted incorrectly.
You shouldn't be surprized here, after all, since most characters display correctly, that should be a pretty good indication that UTF-8 is the correct encoding for this file. But we can do better than that. UTF-8 works a bit differently than other encodings because UTF-8 is a variable width encoding, meaning that not all characters take the same number of bytes. Some characters take only 1 byte, some take 2 bytes, and some take 3 or 4. Although the Korean alphabet (Hangul) only contains 24 letters, those letters get combined into groups of 2 or 3 to form a syllable, with each syllable being represented on a computer by one character. For computers to deal with this, the Korean character set needs to contain every valid combination of all Korean letters. Not all letter combinations form a valid character, and each character needs to contain a vowel in the second position, limiting the total number of valid combinations, but there are still thousands if you count them. On top of that, they are all characters shared by no other language, so they end up getting stored in some of the highest unicode codepages. This means that each valid UTF-8 Korean character is going to end up needing 3 bytes for storage. I know of no other encoding that uses 3-byte characters, so the fact that any characters display correctly proves almost without a doubt that UTF-8 is the correct encoding of the document.
So you open up this source document in a hex editor. Here you can confirm that each character is encoded with three bytes, even the invalid ones. The next step is to prove that the incorrectly displayed characters are invalid UTF-8. So now we really need to start talking about how UTF-8 works.

UTF-8 is a variable width encoding. So when UTF-8 is being decoded, there must be a way for the decoding application to determine if a character is being decoded is a 1, 2, 3, or 4 byte wide character. To do this, UTF-8 uses the high-bit of the each byte. A clear high bit indicates that this character is only 1 byte long. This leaves only 7 bits for character storage however. Fortunately, the already popular ASCII encoding only requires 7 bits, which makes it very simple for UTF-8 and ASCII to work together. In fact, all ASCII is also valid UTF-8. When the high bit is set however, that indicates that this byte is part of a multi-byte character. Now UTF-8 COULD have been designed so that a high bit set would indicate the start of multi-byte encoding, and then each subsequent byte would also have that bit set until the last byte ended with its high bit clear. This would have left maximum storage for the the encoded character, and would probably feel familiar to C programmers because it would somewhat resemble C strings (ending with a null-byte). However, this type of encoding would have a few problems. First of all, there would be no way to differentiate this hypothetical encoding and random binary data. Or it COULD have been implemented such that the number of highest bits set indicated the number of bytes to follow, then leave all the remaining space for character storage. However, this implementation would again have the same problem with regard to interpreting random binary data as valid, and with syncronization (recovering from invalid bytes). To solve these problems, UTF-8 utilizes a combination of both of these hypothetical encodings.
UTF-8 sets the high order bits of the first character to indicate the width of the character, followed by one clear bit, and the remaining bits in that byte to store the high order bits of the character. Then the following bytes in the character (continuation bytes) have the highest bit set, followed by one clear bit, followed by the remaining bits for character storage. This might sound very complicted and wasteful, after all, for a Korean character, it would mean that in the first byte, only 4 bits are available for character storage, and only 6 each in the remaining bytes. However, this specification allows for some really interesting features.
First of all, it means that UTF-8 is self-syncronizing, meaning that if an invalid character is encountered, it does not corrupt all following characters. If an invalid character is encountered, the decoder only needs to look for the next byte with either its high bit clear, or the first 2 bits set to know that it is past the invalid character.
Second, and similarly, this means that detecting invalid UTF-8 programmatically is trivial. Simply iterate through each byte looking for the high bit being set. Then if the high bit is set, count how many of its neighbors are also set, then check that many bytes ahead to ensure that all of those have the highest bit set and the second highest bit clear, if any of those conditions are not met, you have a 100% invalid character, and you have proven it such that none can refute your findings.

And then you win. That's all for now. Peace!