Next: , Up: Introduction

1.1 Unicode

Unicode is a standardized repertoire of characters that contains characters from all scripts of the world, from Latin letters to Chinese ideographs and Babylonian cuneiform glyphs. It also specifies how these characters are to be rendered on a screen or on paper, and how common text processing (word selection, line breaking, uppercasing of page titles etc.) is supposed to behave on Unicode text.

Unicode also specifies three ways of storing sequences of Unicode characters in a computer whose basic unit of data is an 8-bit byte:

Every character is represented as 1 to 4 bytes.
Every character is represented as 1 to 2 units of 16 bits.
UTF-32, a.k.a. UCS-4
Every character is represented as 1 unit of 32 bits.

For encoding Unicode text in a file, UTF-8 is usually used. For encoding Unicode strings in memory for a program, either of the three encoding forms can be reasonably used.

Unicode is widely used on the web. Prior to the use of Unicode, web pages were in many different encodings (ISO-8859-1 for English, French, Spanish, ISO-8859-2 for Polish, ISO-8859-7 for Greek, KOI8-R for Russian, GB2312 or BIG5 for Chinese, ISO-2022-JP-2 or EUC-JP or Shift_JIS for Japanese, and many many others). It was next to impossible to create a document that contained Chinese and Polish text in the same document. Due to the many encodings for Japanese, even the processing of pure Japanese text was error prone.