The following functions find a single boundary between grapheme clusters in a string.
const uint8_t * u8_grapheme_next (const uint8_t *s, const uint8_t *end) ¶const uint16_t * u16_grapheme_next (const uint16_t *s, const uint16_t *end) ¶const uint32_t * u32_grapheme_next (const uint32_t *s, const uint32_t *end) ¶Returns the start of the next grapheme cluster following s,
or end if no grapheme cluster break is encountered before it.
Returns NULL if and only if s == end.
Note that these functions do not handle the case when a character
outside of the range between s and end is needed to
determine the boundary.
This is the case in particular with syllables in Indic scripts or emojis.
Use _grapheme_breaks functions for such cases.
const uint8_t * u8_grapheme_prev (const uint8_t *s, const uint8_t *start) ¶const uint16_t * u16_grapheme_prev (const uint16_t *s, const uint16_t *start) ¶const uint32_t * u32_grapheme_prev (const uint32_t *s, const uint32_t *start) ¶Returns the start of the grapheme cluster preceding s, or
start if no grapheme cluster break is encountered before it.
Returns NULL if and only if s == start.
Note that these functions do not handle the case when a character
outside of the range between start and s is needed to
determine the boundary.
This is the case in particular with syllables in Indic scripts or emojis.
Use _grapheme_breaks functions for such cases.
Note also that these functions work only on well-formed Unicode strings.
The following functions determine all of the grapheme cluster boundaries in a string.
void u8_grapheme_breaks (const uint8_t *s, size_t n, char *p) ¶void u16_grapheme_breaks (const uint16_t *s, size_t n, char *p) ¶void u32_grapheme_breaks (const uint32_t *s, size_t n, char *p) ¶void ulc_grapheme_breaks (const char *s, size_t n, char *p) ¶void uc_grapheme_breaks (const ucs_t *s, size_t n, char *p) ¶Determines the grapheme cluster break points in s, an array of
n units, and stores the result at p[0..nx-1].
p[i] = 1means that there is a grapheme cluster boundary between
s[i-1] and s[i].
p[i] = 0means that s[i-1] and s[i] are part of the
same grapheme cluster.
p[0] is always set to 1, because there is always a
grapheme cluster break at start of text.
In addition to the above variants for UTF-8, UTF-16, and UTF-32 strings,
<unigbrk.h> provides another variant: uc_grapheme_breaks.
This is similar to u32_grapheme_breaks, but it accepts any
characters which may not be represented in UTF-32, such as control
characters.