Previous: Signals, Up: Usage


4.2 Character sets and encodings in tags

This section attempts to describe some intricacies regarding character sets and encodings in tags.

Text in ID3v2 tags can be encoded in a variety of ways. ID3v2.3 and earlier standards support only text encoded in ISO 8859-1 and UCS-2. ID3v2.4 added support for UTF-8 and UTF-16BE1, and replaced UCS-2 with UTF-16.

If you are using id3lib, only ISO 8859-1 and UCS-2/UTF-16 encodings of ID3v2.4 tags are supported. The current C API of id3lib must be extended in order to support UTF-8 and UTF-16BE for ID3v2.4. (Especially, a function ID3Field_GetEncoding is missing.)

TagLib seems to support all encodings used in ID3v2.4 tags.

Unfortunately, many applications still put UTF-8 encoded text in ID3v2.3 and earlier tags. This is incorrect according to the standard2 - single-byte text should be encoded in ISO 8859-1 and nothing else. TagLib handles all single-byte text in ID3v1 and ID3v2.3 tags as ISO 8859-1, while id3lib gives you the option to treat the data as you like. At the moment, GMediaServer assumes single-byte text is encoded in ISO 8859-1 when using id3lib.


Footnotes

[1] The difference between UTF-16, UTF-16BE and UTF-16LE is that strings encoded with UTF-16 must start with a byte order mark, a so called BOM. For UTF-16 the BOM is either 0xFF 0xFE (denoting little endian) or 0xFE 0xFF (denoting big endian).

[2] ID3v2.3.0 Informal standard.