Previous: , Up: Grapheme cluster breaks in strings <unigbrk.h>   [Contents][Index]


10.2 Grapheme cluster break property

This is a more low-level API. The grapheme cluster break property is a property defined in Unicode Standard Annex #29, section “Grapheme Cluster Boundaries”, see https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. It is used for determining the grapheme cluster breaks in a string.

The following are the possible values of the grapheme cluster break property. More values may be added in the future.

Constant: int GBP_OTHER
Constant: int GBP_CR
Constant: int GBP_LF
Constant: int GBP_CONTROL
Constant: int GBP_EXTEND
Constant: int GBP_PREPEND
Constant: int GBP_SPACINGMARK
Constant: int GBP_L
Constant: int GBP_V
Constant: int GBP_T
Constant: int GBP_LV
Constant: int GBP_LVT
Constant: int GBP_RI
Constant: int GBP_ZWJ
Constant: int GBP_EB
Constant: int GBP_EM
Constant: int GBP_GAZ
Constant: int GBP_EBG

The following function looks up the grapheme cluster break property of a character.

Function: int uc_graphemeclusterbreak_property (ucs4_t uc)

Returns the Grapheme_Cluster_Break property of a Unicode character.

The following function determines whether there is a grapheme cluster break between two Unicode characters. It is the primitive upon which the higher-level functions in the previous section are directly based.

Function: bool uc_is_grapheme_break (ucs4_t a, ucs4_t b)

Returns true if there is an grapheme cluster boundary between Unicode characters a and b.

There is always a grapheme cluster break at the start or end of text. You can specify zero for a or b to indicate start of text or end of text, respectively.

This implements the extended (not legacy) grapheme cluster rules described in the Unicode standard, because the standard says that they are preferred.

Note that this function does not handle the case when three or more consecutive characters are needed to determine the boundary. This is the case in particular with syllables in Indic scripts or emojis. Use uc_grapheme_breaks for such cases.


Previous: Grapheme cluster breaks in a string, Up: Grapheme cluster breaks in strings <unigbrk.h>   [Contents][Index]