Next: , Previous: Bookmark file parser, Up: Top


2 Character Set Conversion

convert strings between different character sets using .

2.1 Overview

2.2 File Name Encodings

Historically, Unix has not had a defined encoding for file names: a file name is valid as long as it does not have path separators in it ("/"). However, displaying file names may require conversion: from the character set in which they were created, to the character set in which the application operates. Consider the Spanish file name "Presentación.sxi". If the application which created it uses ISO-8859-1 for its encoding, then the actual file name on disk would look like this:

     
     Character:  P  r  e  s  e  n  t  a  c  i  ó  n  .  s  x  i
     Hex code:   50 72 65 73 65 6e 74 61 63 69 f3 6e 2e 73 78 69
     

However, if the application use UTF-8, the actual file name on disk would look like this:

     
     Character:  P  r  e  s  e  n  t  a  c  i  ó     n  .  s  x  i
     Hex code:   50 72 65 73 65 6e 74 61 63 69 c3 b3 6e 2e 73 78 69
     

Glib uses UTF-8 for its strings, and GUI toolkits like GTK+ that use Glib do the same thing. If you get a file name from the file system, for example, from readdir(3) or from g-dir-read-name, and you wish to display the file name to the user, you will need to convert it into UTF-8. The opposite case is when the user types the name of a file he wishes to save: the toolkit will give you that string in UTF-8 encoding, and you will need to convert it to the character set used for file names before you can create the file with open(2) or fopen(3).

By default, Glib assumes that file names on disk are in UTF-8 encoding. This is a valid assumption for file systems which were created relatively recently: most applications use UTF-8 encoding for their strings, and that is also what they use for the file names they create. However, older file systems may still contain file names created in "older" encodings, such as ISO-8859-1. In this case, for compatibility reasons, you may want to instruct Glib to use that particular encoding for file names rather than UTF-8. You can do this by specifying the encoding for file names in the G_FILENAME_ENCODING environment variable. For example, if your installation uses ISO-8859-1 for file names, you can put this in your ~/.profile:

     
     export G_FILENAME_ENCODING=ISO-8859-1
     

Glib provides the functions g-filename-to-utf8 and g-filename-from-utf8 to perform the necessary conversions. These functions convert file names from the encoding specified in G_FILENAME_ENCODING to UTF-8 and vice-versa. (the missing figure, file-name-encodings-diagram illustrates how these functions are used to convert between UTF-8 and the encoding for file names in the file system.

2.2.1 Checklist for Application Writers

This section is a practical summary of the detailed description above. You can use this as a checklist of things to do to make sure your applications process file name encodings correctly.

If you get a file name from the file system from a function such as readdir(3) or gtk-file-chooser-get-filename, you do not need to do any conversion to pass that file name to functions like open(2), rename(2), or fopen(3) — those are "raw" file names which the file system understands.

If you need to display a file name, convert it to UTF-8 first by using g-filename-to-utf8. If conversion fails, display a string like "‘Unknown file name’". Do not convert this string back into the encoding used for file names if you wish to pass it to the file system; use the original file name instead. For example, the document window of a word processor could display "Unknown file name" in its title bar but still let the user save the file, as it would keep the raw file name internally. This can happen if the user has not set the G_FILENAME_ENCODING environment variable even though he has files whose names are not encoded in UTF-8.

If your user interface lets the user type a file name for saving or renaming, convert it to the encoding used for file names in the file system by using g-filename-from-utf8. Pass the converted file name to functions like fopen(3). If conversion fails, ask the user to enter a different file name. This can happen if the user types Japanese characters when G_FILENAME_ENCODING is set to ‘ISO-8859-1’, for example.

2.3 Usage

— Function: g-convert (str mchars) (len ssize_t) (to_codeset mchars) (from_codeset mchars) ⇒  (ret mchars) (bytes_read size_t) (bytes_written size_t)

Converts a string from one character set to another.

str
the string to convert
to-codeset
name of character set into which to convert str
from-codeset
character set of str.
ret
If the conversion was successful, a string. Otherwise an exception will be thrown.

Note that some encodings may allow nul bytes to occur inside strings. In that case, the Guile wrapper for this function is unsafe.

— Function: g-convert-with-fallback (str mchars) (len ssize_t) (to_codeset mchars) (from_codeset mchars) (fallback mchars) ⇒  (ret mchars) (bytes_read size_t) (bytes_written size_t)

Converts a string from one character set to another, possibly including fallback sequences for characters not representable in the output. Note that it is not guaranteed that the specification for the fallback sequences in fallback will be honored. Some systems may do a approximate conversion from from-codeset to to-codeset in their iconv functions, in which case GLib will simply return that approximate conversion.

str
the string to convert
to-codeset
name of character set into which to convert str
from-codeset
character set of str.
fallback
UTF-8 string to use in place of character not present in the target encoding. (The string must be representable in the target encoding). If ‘#f’, characters not in the target encoding will be represented as Unicode escapes \uxxxx or \Uxxxxyyyy.
ret
If the conversion was successful, a string. Otherwise an exception will be thrown.
— Function: g-locale-to-utf8 (opsysstring mchars) (len ssize_t) ⇒  (ret mchars) (bytes_read size_t) (bytes_written size_t)

Converts a string which is in the encoding used for strings by the C runtime (usually the same as that used by the operating system) in the current locale into a UTF-8 string.

opsysstring
a string in the encoding of the current locale. On Windows this means the system codepage.
ret
The converted string. If the conversion fails, an exception will be raised.
— Function: g-filename-to-utf8 (opsysstring mchars) (len ssize_t) ⇒  (ret mchars) (bytes_read size_t) (bytes_written size_t)

Converts a string which is in the encoding used by GLib for filenames into a UTF-8 string. Note that on Windows GLib uses UTF-8 for filenames.

opsysstring
a string in the encoding for filenames
ret
The converted string. If the conversion fails, an exception will be raised.
— Function: g-filename-from-utf8 (utf8string mchars) (len ssize_t) ⇒  (ret mchars) (bytes_read size_t) (bytes_written size_t)

Converts a string from UTF-8 to the encoding GLib uses for filenames. Note that on Windows GLib uses UTF-8 for filenames.

utf8string
a UTF-8 encoded string.
len
the length of the string, or -1 if the string is nul-terminated.
bytes-read
location to store the number of bytes in the input string that were successfully converted, or ‘#f’. Even if the conversion was successful, this may be less than len if there were partial characters at the end of the input. If the error <g-convert-error-illegal-sequence> occurs, the value stored will the byte offset after the last valid input sequence.
bytes-written
the number of bytes stored in the output buffer (not including the terminating nul).
error
location to store the error occuring, or ‘#f’ to ignore errors. Any of the errors in <g-convert-error> may occur.
ret
The converted string, or ‘#f’ on an error.
— Function: g-filename-from-uri (uri mchars) ⇒  (ret mchars) (hostname mchars)

Converts an escaped ASCII-encoded URI to a local filename in the encoding used for filenames.

uri
a uri describing a filename (escaped, encoded in ASCII).
hostname
Location to store hostname for the URI, or ‘#f’. If there is no hostname in the URI, ‘#f’ will be stored in this location.
error
location to store the error occuring, or ‘#f’ to ignore errors. Any of the errors in <g-convert-error> may occur.
ret
a newly-allocated string holding the resulting filename, or ‘#f’ on an error.
— Function: g-filename-to-uri (filename mchars) (hostname mchars) ⇒  (ret mchars)

Converts an absolute filename to an escaped ASCII-encoded URI, with the path component following Section 3.3. of RFC 2396.

filename
an absolute filename specified in the GLib file name encoding, which is the on-disk file name bytes on Unix, and UTF-8 on Windows
hostname
A UTF-8 encoded hostname, or ‘#f’ for none.
error
location to store the error occuring, or ‘#f’ to ignore errors. Any of the errors in <g-convert-error> may occur.
ret
a newly-allocated string holding the resulting URI, or ‘#f’ on an error.
— Function: g-filename-display-name (filename mchars) ⇒  (ret mchars)

Converts a filename into a valid UTF-8 string. The conversion is not necessarily reversible, so you should keep the original around and use the return value of this function only for display purposes. Unlike g-filename-to-utf8, the result is guaranteed to be non-‘#f’ even if the filename actually isn't in the GLib file name encoding.

If GLib can not make sense of the encoding of filename, as a last resort it replaces unknown characters with U+FFFD, the Unicode replacement character. You can search the result for the UTF-8 encoding of this character (which is "\357\277\275" in octal notation) to find out if filename was in an invalid encoding.

If you know the whole pathname of the file you should use g-filename-display-basename, since that allows location-based translation of filenames.

filename
a pathname hopefully in the GLib file name encoding
ret
a newly allocated string containing a rendition of the filename in valid UTF-8

Since 2.6

— Function: g-filename-display-basename (filename mchars) ⇒  (ret mchars)

Returns the display basename for the particular filename, guaranteed to be valid UTF-8. The display name might not be identical to the filename, for instance there might be problems converting it to UTF-8, and some files can be translated in the display.

If GLib can not make sense of the encoding of filename, as a last resort it replaces unknown characters with U+FFFD, the Unicode replacement character. You can search the result for the UTF-8 encoding of this character (which is "\357\277\275" in octal notation) to find out if filename was in an invalid encoding.

You must pass the whole absolute pathname to this functions so that translation of well known locations can be done.

This function is preferred over g-filename-display-name if you know the whole path, as it allows translation.

filename
an absolute pathname in the GLib file name encoding
ret
a newly allocated string containing a rendition of the basename of the filename in valid UTF-8

Since 2.6

— Function: g-locale-from-utf8 (utf8string mchars) (len ssize_t) ⇒  (ret mchars) (bytes_read size_t) (bytes_written size_t)

Converts a string from UTF-8 to the encoding used for strings by the C runtime (usually the same as that used by the operating system) in the current locale.

utf8string
a UTF-8 encoded string
ret
The converted string. If the conversion fails, an exception will be raised.