Gnash  0.8.10
Enumerations | Functions
gnash::utf8 Namespace Reference

Utilities to convert between std::string and std::wstring. More...

Enumerations

enum  TextEncoding {
  encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE,
  encUTF32BE, encUTF32LE, encSCSU, encUTF7,
  encUTFEBCDIC, encBOCU1
}
enum  EncodingGuess { ENCGUESS_UNICODE = 0, ENCGUESS_JIS = 1, ENCGUESS_OTHER = 2 }

Functions

std::wstring decodeCanonicalString (const std::string &str, int version)
 Converts a std::string with multibyte characters into a std::wstring.
std::string encodeCanonicalString (const std::wstring &wstr, int version)
 Converts a std::wstring into canonical std::string.
std::string encodeLatin1Character (boost::uint32_t ucsCharacter)
 Encodes the given wide character into an at least 8-bit character.
boost::uint32_t decodeNextUnicodeCharacter (std::string::const_iterator &it, const std::string::const_iterator &e)
 Return the next Unicode character in the UTF-8 encoded string.
std::string encodeUnicodeCharacter (boost::uint32_t ucs_character)
 Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.
char * stripBOM (char *in, size_t &size, TextEncoding &encoding)
 Interpret (and skip) Byte Order Mark in input stream.
const char * textEncodingName (TextEncoding enc)
 Return name of a text encoding.
EncodingGuess guessEncoding (const std::string &s, int &length, std::vector< int > &offsets)
 Common code for guessing at the encoding of random text, between.

Detailed Description

Utilities to convert between std::string and std::wstring.

Strings in Gnash are generally stored as std::strings. We have to deal, however, with characters larger than standard ASCII (128), which can be encoded in two different ways.

SWF6 and later use UTF-8, encoded as multibyte characters and allowing many thousands of unique codes. Multibyte characters are difficult to handle, as their length - used for many string operations - is not certain without parsing the string. Converting the string to a wstring (generally a uint32_t - the pp seems only to handle characters up to 65535 - two bytes is the minimum size of a wchar) facilitates string operations, as the length of the string is equal to the number of valid characters.

SWF5 and earlier, however, used the ISO-8859 specification, allowing the standard 128 ASCII characters plus 128 extra characters that depend on the particular subset of ISO-8859. Characters are 8 bits, not the ASCII standard 7. SWF5 cannot handle multi-byte characters without special functions.

It is important that SWF5 can distinguish between the two encodings, so we cannot convert all strings to UTF-8. Please note that, although this is called utf8, what the Adobe player uses is only loosely related to real unicode, so the encoding support here is correspondingly non-standard.


Enumeration Type Documentation

Enumerator:
ENCGUESS_UNICODE 
ENCGUESS_JIS 
ENCGUESS_OTHER 
Enumerator:
encUNSPECIFIED 
encUTF8 
encUTF16BE 
encUTF16LE 
encUTF32BE 
encUTF32LE 
encSCSU 
encUTF7 
encUTFEBCDIC 
encBOCU1 

Function Documentation

DSOEXPORT std::wstring gnash::utf8::decodeCanonicalString ( const std::string &  str,
int  version 
)

Converts a std::string with multibyte characters into a std::wstring.

Returns:
a version-dependent wstring.
Parameters:
strthe canonical string to convert.
versionthe SWF version, used to decide how to decode the string. For SWF5, UTF-8 (or any other) multibyte encoded characters are converted char by char, mangling the string.

References gnash::key::e, and decodeNextUnicodeCharacter().

Referenced by gnash::TextField::TextField(), gnash::TextField::replaceSelection(), and gnash::TextField::updateText().

DSOEXPORT boost::uint32_t gnash::utf8::decodeNextUnicodeCharacter ( std::string::const_iterator &  it,
const std::string::const_iterator &  e 
)

Return the next Unicode character in the UTF-8 encoded string.

Invalid UTF-8 sequences produce a U+FFFD character as output. Advances string iterator past the character returned, unless the returned character is '\0', in which case the iterator does not advance.

References FIRST_BYTE, and NEXT_BYTE.

Referenced by decodeCanonicalString(), and guessEncoding().

DSOEXPORT std::string gnash::utf8::encodeCanonicalString ( const std::wstring &  wstr,
int  version 
)

Converts a std::wstring into canonical std::string.

Returns:
a version-dependent encoded std::string.
Parameters:
wstrthe wide string to convert.
versionthe SWF version, used to decide how to encode the string.

For SWF 5, each character is stored as an 8-bit (at least) char, rather than converting it to a canonical UTF-8 byte sequence. Gnash can then distinguish between 8-bit characters, which it handles correctly, and multi-byte characters, which are regarded as multiple characters for string methods.

References encodeUnicodeCharacter(), and encodeLatin1Character().

Referenced by gnash::TextField::setTextValue(), gnash::TextField::get_text_value(), and gnash::TextField::get_htmltext_value().

DSOEXPORT std::string gnash::utf8::encodeLatin1Character ( boost::uint32_t  ucsCharacter)

Encodes the given wide character into an at least 8-bit character.

Allows storage of Latin1 (ISO-8859-1) characters. This is the format of SWF5 and below.

Referenced by encodeCanonicalString().

DSOEXPORT std::string gnash::utf8::encodeUnicodeCharacter ( boost::uint32_t  ucs_character)

Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.

Referenced by encodeCanonicalString().

DSOEXPORT EncodingGuess gnash::utf8::guessEncoding ( const std::string &  s,
int &  length,
std::vector< int > &  offsets 
)

Common code for guessing at the encoding of random text, between.

TODO: It's doubtful if this even works, and it may not be useful at all.

References width, gnash::key::e, length, gnash::key::c, decodeNextUnicodeCharacter(), ENCGUESS_UNICODE, ENCGUESS_JIS, and ENCGUESS_OTHER.

DSOEXPORT char * gnash::utf8::stripBOM ( char *  in,
size_t &  size,
TextEncoding &  encoding 
)

Interpret (and skip) Byte Order Mark in input stream.

This function takes a pointer to a buffer and returns the start of actual data after an eventual BOM. No conversion is performed, no bytes copy, just skipping of the BOM snippet and interpretation of it returned to the encoding input parameter.

See http://en.wikipedia.org/wiki/Byte-order_mark

Parameters:
inThe input buffer.
sizeSize of the input buffer, will be decremented by the size of the BOM, if any.
encodingOutput parameter, will always be set. encUNSPECIFIED if no BOM is found.
Returns:
A pointer either equal to 'in' or some bytes inside it.

References encUNSPECIFIED, encUTF16LE, encUTF16BE, encUTF8, encUTF32BE, and encUTF32LE.

Referenced by gnash::movie_root::LoadCallback::processLoad().

DSOEXPORT const char * gnash::utf8::textEncodingName ( TextEncoding  enc)