January 09, 2016 - UTF-8 and Invalid Encoding
January 09, 2016
UTF-8 and Invalid Encoding
I think I tend to end up dealing directly with fonts and font related
issues ore than the average programmer. Mostly due to the fact that I
tend to work more with low-level interfaces where fonts aren't already
available, but also partially because I work with other languages
occasionally. As the somewhat proud author of libpsf, a library for
loading psf fonts, I think I know a bit about fonts. So when I get
blamed for an issue with fonts, it doesn't make me real happy. However,
it does provide an opportunity to talk about something different,
encodings.
Let's consider the following hypothetical situation. You are presented
with a source file containing Korean text, however, once compiled,
some of the characters in the document display incorrectly. Some
characters show up as black boxes, and some show up as Chinese
characters, others show as Korean characters, but not the intended
character. However, the majority of characters display correctly. You
contact the author of the source document and inform the of the issue,
their response however, is that the issue is with your fonts, not the
document.
Of course, this answer is not acceptable, and not correct, but how would
you go about demonstrating this? The first thing you would do is check
to ensure that your fonts are indeed sufficient to display the characters.
Once this is done, you might wonder if maybe the encoding you are using
to display is not correct. And now you are finally almost on the right
track. There are really only 5 or 6 encodings valid for Korean characters,
one of which is for North Korea, so we can rule that out. 3 are UTF-8,
UTF-16, and UTF-32, the others are rarely used and very different from
UTF, but you try them anyway and they decode the text even more poorly
than UTF-8, with all characters interpreted incorrectly.
You shouldn't be surprized here, after all, since most characters display
correctly, that should be a pretty good indication that UTF-8 is the
correct encoding for this file. But we can do better than that. UTF-8
works a bit differently than other encodings because UTF-8 is a variable
width encoding, meaning that not all characters take the same number of
bytes. Some characters take only 1 byte, some take 2 bytes, and some
take 3 or 4. Although the Korean alphabet (Hangul) only contains 24
letters, those letters get combined into groups of 2 or 3 to form a
syllable, with each syllable being represented on a computer by one
character. For computers to deal with this, the Korean character set
needs to contain every valid combination of all Korean letters. Not all
letter combinations form a valid character, and each character needs to
contain a vowel in the second position, limiting the total number of
valid combinations, but there are still thousands if you count them. On
top of that, they are all characters shared by no other language, so they
end up getting stored in some of the highest unicode codepages. This means
that each valid UTF-8 Korean character is going to end up needing 3 bytes
for storage. I know of no other encoding that uses 3-byte characters, so
the fact that any characters display correctly proves almost without a
doubt that UTF-8 is the correct encoding of the document.
So you open up this source document in a hex editor. Here you can confirm
that each character is encoded with three bytes, even the invalid ones.
The next step is to prove that the incorrectly displayed characters are
invalid UTF-8. So now we really need to start talking about how UTF-8
works.
UTF-8 is a variable width encoding. So when UTF-8 is being decoded, there
must be a way for the decoding application to determine if a character is
being decoded is a 1, 2, 3, or 4 byte wide character. To do this, UTF-8
uses the high-bit of the each byte. A clear high bit indicates that this
character is only 1 byte long. This leaves only 7 bits for character
storage however. Fortunately, the already popular ASCII encoding only
requires 7 bits, which makes it very simple for UTF-8 and ASCII to work
together. In fact, all ASCII is also valid UTF-8. When the high bit is set
however, that indicates that this byte is part of a multi-byte character.
Now UTF-8 COULD have been designed so that a high bit set would indicate
the start of multi-byte encoding, and then each subsequent byte would also
have that bit set until the last byte ended with its high bit clear. This
would have left maximum storage for the the encoded character, and would
probably feel familiar to C programmers because it would somewhat resemble
C strings (ending with a null-byte). However, this type of encoding would
have a few problems. First of all, there would be no way to differentiate
this hypothetical encoding and random binary data. Or it COULD have been
implemented such that the number of highest bits set indicated the number
of bytes to follow, then leave all the remaining space for character
storage. However, this implementation would again have the same problem
with regard to interpreting random binary data as valid, and with
syncronization (recovering from invalid bytes). To solve these problems,
UTF-8 utilizes a combination of both of these hypothetical encodings.
UTF-8 sets the high order bits of the first character to indicate the
width of the character, followed by one clear bit, and the remaining bits
in that byte to store the high order bits of the character. Then the
following bytes in the character (continuation bytes) have the highest bit
set, followed by one clear bit, followed by the remaining bits for character
storage. This might sound very complicted and wasteful, after all, for a
Korean character, it would mean that in the first byte, only 4 bits are
available for character storage, and only 6 each in the remaining bytes.
However, this specification allows for some really interesting features.
First of all, it means that UTF-8 is self-syncronizing, meaning that if an
invalid character is encountered, it does not corrupt all following
characters. If an invalid character is encountered, the decoder only needs
to look for the next byte with either its high bit clear, or the first 2
bits set to know that it is past the invalid character.
Second, and similarly, this means that detecting invalid UTF-8
programmatically is trivial. Simply iterate through each byte looking for
the high bit being set. Then if the high bit is set, count how many of its
neighbors are also set, then check that many bytes ahead to ensure that
all of those have the highest bit set and the second highest bit clear, if
any of those conditions are not met, you have a 100% invalid character,
and you have proven it such that none can refute your findings.
And then you win. That's all for now. Peace!