Previous: Parsed URIs, Up: URI Parsing

2.2 URI Encoding

The url-generic-parse-url parser does not obey RFC 3986 in one respect: it allows non-ASCII characters in URI strings.

Strictly speaking, RFC 3986 compatible URIs may only consist of ASCII characters; non-ASCII characters are represented by converting them to UTF-8 byte sequences, and performing percent encoding on the bytes. For example, the o-umlaut character is converted to the UTF-8 byte sequence ‘\xD3\xA7’, then percent encoded to ‘%D3%A7’. (Certain “reserved” ASCII characters must also be percent encoded when they appear in URI components.)

The function url-encode-url can be used to convert a URI string containing arbitrary characters to one that is properly percent-encoded in accordance with RFC 3986.

— Function: url-encode-url url-string

This function return a properly URI-encoded version of url-string. It also performs URI normalization, e.g., converting the scheme component to lowercase if it was previously uppercase.

To convert between a string containing arbitrary characters and a percent-encoded all-ASCII string, use the functions url-hexify-string and url-unhex-string:

— Function: url-hexify-string string &optional allowed-chars

This function performs percent-encoding on string, and returns the result.

If string is multibyte, it is first converted to a UTF-8 byte string. Each byte corresponding to an allowed character is left as-is, while all other bytes are converted to a three-character sequence: ‘%’ followed by two upper-case hex digits.

The allowed characters are specified by allowed-chars. If this argument is nil, the allowed characters are those specified as unreserved characters by RFC 3986 (see the variable url-unreserved-chars). Otherwise, allowed-chars should be a vector whose n-th element is non-nil if character n is allowed.

— Function: url-unhex-string string &optional allow-newlines

This function replaces percent-encoding sequences in string with their character equivalents, and returns the resulting string.

If allow-newlines is non-nil, it allows the decoding of carriage returns and line feeds, which are normally forbidden in URIs.