Next: , Previous: , Up: GNU libunistring   [Contents][Index]


10 Grapheme cluster breaks in strings <unigbrk.h>

This include file declares functions for determining where in a string “grapheme clusters” start and end. A “grapheme cluster” is an approximation to a user-perceived character, which sometimes corresponds to multiple Unicode characters. Editing operations such as mouse selection, cursor movement, and backspacing often operate on grapheme clusters as units, not on individual characters.

Some grapheme clusters are built from a base character and a combining character. The letter ‘é’, for example, is most commonly represented in Unicode as a single character U+00E8 LATIN SMALL LETTER E WITH ACUTE. It is, however, equally valid to use the pair of characters U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. Since the user would perceive this pair of characters as a single character, they would be grouped into a single grapheme cluster.

But there are also grapheme clusters that consist of several base characters. For example, a Devanagari letter and a Devanagari vowel sign that follows it may form a grapheme cluster. Similarly, some pairs of Thai characters and Hangul syllables (formed by two or three Hangul characters) are grapheme clusters.