UTF-8

UTF-8 Definition

UTF-8, or Unicode Transformation Format-8, is an octet (8-bit) variable-length encoding of Unicode characters. Although UTF-8 can represent any Unicode character, its preliminary encoding of byte codes and character assignments is congruous with ASCII. Because of this attribute, UTF-8 is increasingly becoming the preferred encoding standard for XML.

The Internet Engineering Task Force (IETF) has mandated all Internet protocols to identify all the encoding used for character data, with the UTF-8 included in its list of supported character encodings. In addition, the Internet Mail Consortium (IMC) advocates that all e-mail programs should display and create mail using the UTF-8 Unicode character encodings.

UTF-8 Encoding

UTF-8 encodes each character in 1 to 4 octets, or 8-bit bytes. It takes 1 byte to encode the 128 US-ASCII characters. It takes 2 bytes to encode Latin letters with diacritics and characters from the Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, and Thaana alphabets. It takes 3 bytes to encode the rest of the characters in the Basic Multilingual Plane, which includes commonly used characters. It takes 4 bytes to encode the characters in the other planes of Unicode, which are rarely used.

Rationale Behind the Design

The UTF-8 was designed to satisfy the following properties:

The most significant bit of a 1 byte character is always 0.

The first byte of a multi-byte sequence determines the length of the sequence. For instance, the most significant bits are 110 for 2-byte sequences, 1110 for 3-byte sequences, and so forth.

The two most significant bits with the remaining bytes in a multi-byte sequence are 10.

A UTHF-8 stream contains neither the byte FE or FF to ensure that a UTF-8 stream does not resemble a UTF-16 stream.

The satisfaction of these properties ensures that no byte sequence of one character will be contained in a longer byte sequence of another character. This allows for the application of byte-wise sub-string matching when searching for a word or a phrase within a text. The satisfaction of these properties also allows for resynchronization at the beginning of the next character in the event of loss or corruption of one or more complete bytes.

Moreover, the design of the byte sequences decreases the probability for misinterpretation. The probability for a random sequence of bytes to be a valid UTF-8 and not ASCII is 3.1% for a 2-byte sequence, 0.30=8=9% for a 3-byte sequence, and lower for longer sequences.

Advantages and Disadvantages of UTF-8

General Advantages

  • Existing ASCII texts need no conversion into UTF-8 as a plain ASCII string is also a valid UTF-8 string.
  • When sorted by standard byte-oriented sorting routines, UTF-8 strings will generate the same result as sorted by Unicode code points.
  • The standard encodings for XML documents are UTF-8 and UTF-16, while all other encodings must be specified, either through a text declaration or externally.
  • Any byte-oriented string or algorithm can be incorporated with UTF-8 data, provided that the inputs consist of UTF-8 characters.
  • The design of the byte sequences decreases the probability for misinterpretation as a string of characters in any other encoding. The probability for misinterpretation decreases as the string length increases.

Disadvantages

  • Aside from ASCII characters, the UTF-8 encoding text is large for single-byte encoding.
  • String cutting is easier for single byte per character encodings.