Next: , Previous: Strings, Up: Top


10 Unicode Manipulation

functions operating on Unicode characters and UTF-8 strings.

10.1 Overview

This section describes a number of functions for dealing with Unicode characters and strings. There are analogues of the traditional ctype.h character classification and case conversion functions, UTF-8 analogues of some string utility functions, functions to perform normalization, case conversion and collation on UTF-8 strings and finally functions to convert between the UTF-8, UTF-16 and UCS-4 encodings of Unicode.

The implementations of the Unicode functions in GLib are based on the Unicode Character Data tables, which are available from www.unicode.org. GLib 2.8 supports Unicode 4.0, GLib 2.10 supports Unicode 4.1, GLib 2.12 supports Unicode 5.0.

10.2 Usage

— Function: g-unichar-validate (ch unsigned-int32) ⇒  (ret bool)

Checks whether ch is a valid Unicode character. Some possible integer values of ch will not be valid. 0 is considered a valid character, though it's normally a string terminator.

ch
a Unicode character
ret
#t’ if ch is a valid Unicode character
— Function: g-unichar-isalnum (unsigned-int32) ⇒  (ret bool)

Determines whether a character is alphanumeric. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is an alphanumeric character
— Function: g-unichar-isalpha (unsigned-int32) ⇒  (ret bool)

Determines whether a character is alphabetic (i.e. a letter). Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is an alphabetic character
— Function: g-unichar-iscntrl (unsigned-int32) ⇒  (ret bool)

Determines whether a character is a control character. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is a control character
— Function: g-unichar-isdigit (unsigned-int32) ⇒  (ret bool)

Determines whether a character is numeric (i.e. a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is a digit
— Function: g-unichar-isgraph (unsigned-int32) ⇒  (ret bool)

Determines whether a character is printable and not a space (returns ‘#f’ for control characters, format characters, and spaces). g-unichar-isprint is similar, but returns ‘#t’ for spaces. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is printable unless it's a space
— Function: g-unichar-islower (unsigned-int32) ⇒  (ret bool)

Determines whether a character is a lowercase letter. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is a lowercase letter
— Function: g-unichar-isprint (unsigned-int32) ⇒  (ret bool)

Determines whether a character is printable. Unlike g-unichar-isgraph, returns ‘#t’ for spaces. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is printable
— Function: g-unichar-ispunct (unsigned-int32) ⇒  (ret bool)

Determines whether a character is punctuation or a symbol. Given some UTF-8 text, obtain a character value with g-utf8-get-char.

c
a Unicode character
ret
#t’ if c is a punctuation or symbol character
— Function: g-unichar-isspace (unsigned-int32) ⇒  (ret bool)

Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.). Given some UTF-8 text, obtain a character value with g-utf8-get-char.

(Note: don't use this to do word breaking; you have to use Pango or equivalent to get word breaking right, the algorithm is fairly complex.)

c
a Unicode character
ret
#t’ if c is a space character
— Function: g-unichar-isupper (unsigned-int32) ⇒  (ret bool)

Determines if a character is uppercase.

c
a Unicode character
ret
#t’ if c is an uppercase character
— Function: g-unichar-isxdigit (unsigned-int32) ⇒  (ret bool)

Determines if a character is a hexidecimal digit.

c
a Unicode character.
ret
#t’ if the character is a hexadecimal digit
— Function: g-unichar-istitle (unsigned-int32) ⇒  (ret bool)

Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.

c
a Unicode character
ret
#t’ if the character is titlecase
— Function: g-unichar-isdefined (unsigned-int32) ⇒  (ret bool)

Determines if a given character is assigned in the Unicode standard.

c
a Unicode character
ret
#t’ if the character has an assigned value
— Function: g-unichar-iswide (unsigned-int32) ⇒  (ret bool)

Determines if a character is typically rendered in a double-width cell.

c
a Unicode character
ret
#t’ if the character is wide
— Function: g-unichar-iswide-cjk (unsigned-int32) ⇒  (ret bool)

Determines if a character is typically rendered in a double-width cell under legacy East Asian locales. If a character is wide according to g-unichar-iswide, then it is also reported wide with this function, but the converse is not necessarily true. See the Unicode Standard Annex for details.

c
a Unicode character
ret
#t’ if the character is wide in legacy East Asian locales

Since 2.12

— Function: g-unichar-toupper (unsigned-int32) ⇒  (ret unsigned-int32)

Converts a character to uppercase.

c
a Unicode character
ret
the result of converting c to uppercase. If c is not an lowercase or titlecase character, or has no upper case equivalent c is returned unchanged.
— Function: g-unichar-tolower (unsigned-int32) ⇒  (ret unsigned-int32)

Converts a character to lower case.

c
a Unicode character.
ret
the result of converting c to lower case. If c is not an upperlower or titlecase character, or has no lowercase equivalent c is returned unchanged.
— Function: g-unichar-totitle (unsigned-int32) ⇒  (ret unsigned-int32)

Converts a character to the titlecase.

c
a Unicode character
ret
the result of converting c to titlecase. If c is not an uppercase or lowercase character, c is returned unchanged.
— Function: g-unichar-digit-value (unsigned-int32) ⇒  (ret int)

Determines the numeric value of a character as a decimal digit.

c
a Unicode character
ret
If c is a decimal digit (according to g-unichar-isdigit), its numeric value. Otherwise, -1.
— Function: g-unichar-xdigit-value (unsigned-int32) ⇒  (ret int)

Determines the numeric value of a character as a hexidecimal digit.

c
a Unicode character
ret
If c is a hex digit (according to g-unichar-isxdigit), its numeric value. Otherwise, -1.
— Function: g-unichar-type (unsigned-int32) ⇒  (ret <g-unicode-type>)

Classifies a Unicode character by type.

c
a Unicode character
ret
the type of the character.
— Function: g-unichar-break-type (unsigned-int32) ⇒  (ret <g-unicode-break-type>)

Determines the break type of c. c should be a Unicode character (to derive a character from UTF-8 encoded text, use g-utf8-get-char). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such as pango-break instead of caring about break types yourself.

c
a Unicode character
ret
the break type of c
— Function: g-unichar-get-mirror-char (ch unsigned-int32) ⇒  (ret bool) (mirrored_ch unsigned-int32)

In Unicode, some characters are mirrored. This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text.

If ch has the Unicode mirrored property and there is another unicode character that typically has a glyph that is the mirror image of ch's glyph and mirrored-ch is set, it puts that character in the address pointed to by mirrored-ch. Otherwise the original character is put.

ch
a Unicode character
mirrored-ch
location to store the mirrored character
ret
#t’ if ch has a mirrored character, ‘#f’ otherwise

Since 2.4

— Function: g-utf8-get-char (mchars) ⇒  (ret unsigned-int32)

Converts a sequence of bytes encoded as UTF-8 to a Unicode character. If p does not point to a valid UTF-8 encoded character, results are undefined. If you are not sure that the bytes are complete valid Unicode characters, you should use g-utf8-get-char-validated instead.

p
a pointer to Unicode character encoded as UTF-8
ret
the resulting character
— Function: g-utf8-find-next-char (mchars) ⇒  (ret mchars)

Finds the start of the next UTF-8 character in the string after p.

p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte.

p
a pointer to a position within a UTF-8 encoded string
end
a pointer to the end of the string, or ‘#f’ to indicate that the string is nul-terminated, in which case the returned value will be
ret
a pointer to the found character or ‘#f
— Function: g-utf8-strlen (mchars) ⇒  (ret long)

Returns the length of the string in characters.

p
pointer to the start of a UTF-8 encoded string.
max
the maximum number of bytes to examine. If max is less than 0, then the string is assumed to be nul-terminated. If max is 0, p will not be examined and may be ‘#f’.
ret
the length of the string in characters
— Function: g-utf8-strchr (mchars) (unsigned-int32) ⇒  (ret mchars)

Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.

p
a nul-terminated UTF-8 encoded string
len
the maximum length of p
c
a Unicode character
ret
#f’ if the string does not contain the character, otherwise, a pointer to the start of the leftmost occurrence of the character in the string.
— Function: g-utf8-strrchr (mchars) (unsigned-int32) ⇒  (ret mchars)

Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.

p
a nul-terminated UTF-8 encoded string
len
the maximum length of p
c
a Unicode character
ret
#f’ if the string does not contain the character, otherwise, a pointer to the start of the rightmost occurrence of the character in the string.
— Function: g-utf8-strreverse (mchars) ⇒  (ret mchars)

Reverses a UTF-8 string. str must be valid UTF-8 encoded text. (Use g-utf8-validate on all text before trying to use UTF-8 utility functions with it.)

Note that unlike g-strreverse, this function returns newly-allocated memory, which should be freed with g-free when no longer needed.

str
a UTF-8 encoded string
len
the maximum length of str to use. If len < 0, then the string is nul-terminated.
ret
a newly-allocated string which is the reverse of str.

Since 2.2

— Function: g-utf8-validate (mchars) ⇒  (ret bool)

Validates UTF-8 encoded text. str is the text to validate; if str is nul-terminated, then max-len can be -1, otherwise max-len should be the number of bytes to validate. If end is non-‘#f’, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise).

Note that g-utf8-validate returns ‘#f’ if max-len is positive and NUL is met before max-len bytes have been read.

Returns ‘#t’ if all of str was valid. Many GLib and GTK+ routines require valid UTF-8 as input; so data read from a file or the network should be checked with g-utf8-validate before doing anything else with it.

str
a pointer to character data
max-len
max bytes to validate, or -1 to go until NUL
end
return location for end of valid data
ret
#t’ if the text was valid UTF-8
— Function: g-utf8-strup (mchars) ⇒  (ret mchars)

Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)

str
a UTF-8 encoded string
len
length of str, in bytes, or -1 if str is nul-terminated.
ret
a newly allocated string, with all characters converted to uppercase.
— Function: g-utf8-strdown (mchars) ⇒  (ret mchars)

Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.

str
a UTF-8 encoded string
len
length of str, in bytes, or -1 if str is nul-terminated.
ret
a newly allocated string, with all characters converted to lowercase.
— Function: g-utf8-casefold (mchars) ⇒  (ret mchars)

Converts a string into a form that is independent of case. The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling g-utf8-casefold on other strings.

Note that calling g-utf8-casefold followed by g-utf8-collate is only an approximation to the correct linguistic case insensitive ordering, though it is a fairly good one. Getting this exactly right would require a more sophisticated collation function that takes case sensitivity into account. GLib does not currently provide such a function.

str
a UTF-8 encoded string
len
length of str, in bytes, or -1 if str is nul-terminated.
ret
a newly allocated string, that is a case independent form of str.
— Function: g-utf8-normalize (mchars) (mode <g-normalize-mode>) ⇒  (ret mchars)

Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. You should generally call g-utf8-normalize before comparing two Unicode strings.

The normalization mode ‘G_NORMALIZE_DEFAULT’ only standardizes differences that do not affect the text content, such as the above-mentioned accent representation. ‘G_NORMALIZE_ALL’ also standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. For example, g-utf8-collate normalizes with ‘G_NORMALIZE_ALL’ as its first step.

G_NORMALIZE_DEFAULT_COMPOSE’ and ‘G_NORMALIZE_ALL_COMPOSE’ are like ‘G_NORMALIZE_DEFAULT’ and ‘G_NORMALIZE_ALL’, but returned a result with composed forms rather than a maximally decomposed form. This is often useful if you intend to convert the string to a legacy encoding or pass it to a system with less capable Unicode handling.

str
a UTF-8 encoded string.
len
length of str, in bytes, or -1 if str is nul-terminated.
mode
the type of normalization to perform.
ret
a newly allocated string, that is the normalized form of str.
— Function: g-utf8-collate (str1 mchars) (str2 mchars) ⇒  (ret int)

Compares two strings for ordering using the linguistically correct rules for the current locale. When sorting a large number of strings, it will be significantly faster to obtain collation keys with g-utf8-collate-key and compare the keys with strcmp when sorting instead of sorting the original strings.

str1
a UTF-8 encoded string
str2
a UTF-8 encoded string
ret
< 0 if str1 compares before str2, 0 if they compare equal, > 0 if str1 compares after str2.
— Function: g-utf8-collate-key (mchars) ⇒  (ret mchars)

Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp. The results of comparing the collation keys of two strings with strcmp will always be the same as comparing the two original keys with g-utf8-collate.

str
a UTF-8 encoded string.
len
length of str, in bytes, or -1 if str is nul-terminated.
ret
a newly allocated string. This string should be freed with g-free when you are done with it.
— Function: g-utf8-collate-key-for-filename (mchars) ⇒  (ret mchars)

Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp.

In order to sort filenames correctly, this function treats the dot '.' as a special case. Most dictionary orderings seem to consider it insignificant, thus producing the ordering "event.c" "eventgenerator.c" "event.h" instead of "event.c" "event.h" "eventgenerator.c". Also, we would like to treat numbers intelligently so that "file1" "file10" "file5" is sorted as "file1" "file5" "file10".

str
a UTF-8 encoded string.
len
length of str, in bytes, or -1 if str is nul-terminated.
ret
a newly allocated string. This string should be freed with g-free when you are done with it.

Since 2.8

— Function: g-unichar-to-utf8 (unsigned-int32) ⇒  (ret mchars)

Converts a single character to UTF-8.

c
a Unicode character code
outbuf
output buffer, must have at least 6 bytes of space. If ‘#f’, the length will be computed and returned and nothing will be written to outbuf.
ret
number of bytes written