Unicode

Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language to a single unique integer number, called a code point. It is the explicit aim of Unicode to abolish traditional character encodings such as those defined by the ISO 8859 standard, which are used in the various countries of the world, but are largely incompatible with each other.

Unicode in intent encodes the underlying characters and not variant glyphs for such characters.

Unicode aims to provide a code point for each character, but not for each glyph--or to put this in more common (but less accurate) terms, Unicode aims to provide a unique number for each letter, without regard to typographic variations used by printers.

This simple aim is greatly complicated by another aim, which is to provide lossless conversion amongst different existing encodings.

Table of contents

1 Unicode Consortium
2 Repertoire
3 Encodings

3.1 UTF-32
3.2 UTF-16
3.3 UTF-8
3.4 UTF-7

4 Miscellaneous
5 Unicode on the web
6 Unicode fonts
7 Unicode revision history
8 External links

Unicode Consortium

The California-based Unicode Consortium first published "The Unicode Standard" in 1991, and continues to develop standards based on that original work. Unicode was developed in conjunction with the International Organization for Standardization and it shares its character repertoire with ISO 10646. Unicode and ISO 10646 are equivalent as character encodings, but The Unicode Standard contains much more information for implementers, covering, in depth, topics such as bitwise encoding, collation, and rendering, and enumerating a multitude of character properties, including those needed for BiDi support. The two standards also have slightly different terminology.

Repertoire

Unicode reserves 1114112 (= 2²⁰+2¹⁶) code points, and currently assigns characters to more than 96000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.

The Unicode code space for characters is divided into 17 "planes" and each plane has 65536 code points. The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned, so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode CJK characters.

Two more planes are used for "graphic" characters. Plane 1, the Supplementary Multilingual Plane (SMP) is mostly used for historic scripts such as Linear B, but is also used for musical and mathematical symbols. Plane 2, the Supplementary Ideographic Plane (SIP) is used for about 40000 rare Chinese characters that are mostly historic, although there are some modern ones. Plane 14 currently contains some non-recommended language tag characters and some variation selection characters. Plane 15 and Plane 16 are open for any private use.

There is much controversy among CJK specialists, particularly Japanese ones, about the desirability and technical merit of the "Han unification" process used to map multiple Chinese and Japanese character sets into a single set of unified glyphs. (See Chinese character encoding)

The cap of ~2²⁰ code points exists in order to maintain compatibility with the UTF-16 encoding, which can only address that range (see below). The 10% utilisation of the Unicode code space suggests that this ~20 bit limit is unlikely to be reached in the near future.

Encodings

So far, it was only said that Unicode is a means to assign a unique number for all characters used by humans in written language. How these numbers are stored in text processing is another matter; problems result from the fact that much software in the west has so far been written to deal with 8-bit character encodings only, and Unicode support has only been added slowly in recent years.

The internal logic of much 8-bit legacy software typically permits only 8 bits for each character, making it impossible to use more than 256 code points without special processing. Several mechanisms have therefore been suggested to implement Unicode; which one is chosen depends on available storage space, source code compatibility, and interoperability with other systems.

UTF-32

The simplest possible way to store all possible 2²⁰+2¹⁶ Unicode code points is to use 32 bits for each character, that is, four bytes -- hence, this encoding is referred to as UTF-32 by Unicode and UCS-4 in ISO/IEC 10646 documentation. The main problem with this method is that it uses four times the space of traditional encodings, which is why it is rarely used for external storage. However, due to its simplicity, many programs will use 32 bits encodings internally when processing Unicode.

UTF-16

UTF-16 is a variable-length encoding that uses either one or two 16-bit words, manifesting on most platforms as 2 or 4 8-bit bytes, for each character.

The byte order is affected by the platform hardware, so all UTF-16 data streams are required to begin with the zero-width no-break space character (U+FEFF), which is not considered part of the text data but is just a Byte Order Mark (BOM), providing a consistent preamble (bytes FE FF or FF FE) that enables the decoder to know the stream's byte order. Given the unlikelihood of these bytes appearing at the head of a non-UTF-16 stream, UTF-16 streams are effectively self-identifying. There are also two variants of UTF-16 that preclude the use of a BOM: UTF-16LE and UTF-16BE.

UTF-16 allows characters in the BMP to be encoded directly as single 16-bit code values. Characters beyond the BMP are encoded as a pair of 16-bit code values drawn from a range of reserved code points, in the D800-DFFF range, that have not been individually assigned to characters.

UTF-8

Another common encoding is UTF-8, which is also a variable-length encoding. The first 128 code points are represented by one byte, and are equivalent to ASCII. Representation of higher code points requires two to six bytes.

UTF-8 has several advantages, especially when adding support for Unicode to existing software. For one, no changes are required for supporting ASCII only. Secondly, most functions from the standard library of the C programming language that have traditionally been used for character processing (such as strcmp for comparisons and trivial sorting) still work, because they operate on 8-bit values. (By contrast, to support the 16- or 32-bit encodings mentioned above, large parts of older software would have to be rewritten.) Third, for most texts that use relatively few non-ASCII characters (that is, texts in most Western languages), the encoding is very space-efficient because it will require only slightly more than 8 bits per character.

The exact mechanics of UTF-8 are as follows (numbers prefixed with 0x are in hexadecimal notation):

For a scalar value less than 0x80, use one byte with the same scalar value.
For a scalar value less than 0x800, use two bytes, where the first is 0xC0 plus the number represented by the 7th-11th least significant bits, while the second is 0x80 plus the 1st-6th least significant bits.
For a scalar value less than 0x10000, use three bytes. The first is 0xE0 plus the 13th-16th LSBs, the second 0x80 plus the 7th-12th LSBs, and the third 0x80 plus the 1st-6th LSBs.
For a scalar value less than 0x200000, use four bytes, namely 0xF0 plus the 19th-21st LSBs; 0x80 plus the 13th-18th; 0x80 plus the 7th-12th; and 0x80 plus the 1st-6th LSBs.

Currently no other sequences are legal because no scalar values above 0x200000 have been assigned Unicode characters yet; however, sequences of up to six bytes will be legal once these code points will be assigned.

As a consequence of the above details, the following properties of multi-byte sequences hold:

The most significant bit of a single-byte character is always 0.
The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, etc.
The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.

UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although it is argued that this property adds redundancy to UTF-8-encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently.

UTF-7

The least common encoding is probably UTF-7. MIME technically requires that the encoding used to send email is ASCII, so any email that uses a Unicode encoding is invalid. However, this restriction is universally ignored. UTF-7 allows mail to use Unicode but also follow the standards. Any standard ASCII character is encoded as is, any character above the 128 ASCII characters is encoded using an escape sequence of a '+' character followed by the Unicode character encoded in Base64, and terminated by a '-'. Literal '+' characters are encoded as '+-'.

Miscellaneous

The Unicode standard also includes a number of related items, such as character properties, text normalisation forms, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

In 1997 a proposal was made to encode the characters of the Klingon language in Plane 1 of ISO/IEC 10646-2. The proposal was rejected in 2001 as "inappropriate for encoding." The elvish script Tengwar from J. R. R. Tolkien's Middle Earth setting was proposed for inclusion in 1993.

Unicode on the web

Recent web browsers display web pages using Unicode if an appropriate font is installed (see Unicode and HTML).

Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4.0 and XML 1.0 documents are, by definition, comprised of characters from the entire range of Unicode code points, minus only a handful of disallowed control characters and the permanently-unassigned code points D800-DFFF and FFFE-FFFF. These characters manifest either directly as bytes according to document's encoding, if the encoding supports them, or they may be written as numeric character references based on the character's Unicode code point, as long as the document's encoding supports the digits and symbols required to write the references (all encodings approved for use on the Internet do). For example, the references Δ Й ק م ๗ ぁ 叶 葉 냻 (or the same numeric values expressed in hexadecimal, with &#x as the prefix) display on your browser as Δ, Й, ק, م, ๗, ぁ, 叶, 葉 and 냻 -- if you have the proper fonts, these symbols look like the Greek capital letter "Delta", Cyrillic capital letter "Short I", the Arabic letter "Meem", the Hebrew letter "Qof", Thai numeral 7, Japanese Hiragana "A", simplified Chinese "Leaf", traditional Chinese "Leaf", and a Korean Han-geul syllable "Nyrh", respectively.

Unicode fonts

Free and retail fonts based on Unicode are common, since first TrueType and now OpenType use Unicode. These font formats map Unicode code points to glyphs.

There are thousands of fonts on the market, but probably fewer than a half-dozen fonts attempt to support the majority of Unicode's character repertoire. Instead, Unicode based fonts typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. There are several reasons for this: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to be resource hogs in computing environments; and operating systems and applications are becoming increasingly intelligent in regard to obtaining glyph information from separate font files as they are needed. Furthermore, it is a monumental task to design a consistent set of rendering instructions for tens of thousands of glyphs; such a venture passes the point of diminishing returns.

Unicode revision history

1991 Unicode 1.0
1993 Unicode 1.1
1996 Unicode 2.0
1998 Unicode 2.1
2000 Unicode 3.0
2001 Unicode 3.1
2002 Unicode 3.2
2003 Unicode 4.0

External links

Unicode Consortium
Unicode versions: 3.1, 3.2, 4.0
Alan Wood's Unicode Resources (contains lists of word processors with Unicode capability)
Unicode Code Charts (PDF)
UTF-8, UTF-16, UTF-32 Code Charts
The Letter Database
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Project UTF-8, evangelizing Unicode support in free software
Unicode TTF fonts: Code2000: license info and download link, Junicode: license info and download link, Titus Cyberbit Basic: license info & download link
ConScript Unicode Registry (a project to standardize part of the PUA for use with artificial scripts)