6.12.3 Encoding

Textual input and output on Guile ports is layered on top of binary operations. To this end, each port has an associated character encoding that controls how bytes read from the port are converted to characters, and how characters written to the port are converted to bytes.

Scheme Procedure: port-encoding port
C Function: scm_port_encoding (port)

Returns, as a string, the character encoding that port uses to interpret its input and output.

Scheme Procedure: set-port-encoding! port enc
C Function: scm_set_port_encoding_x (port, enc)

Sets the character encoding that will be used to interpret I/O to port. enc is a string containing the name of an encoding. Valid encoding names are those defined by IANA, for example "UTF-8" or "ISO-8859-1".

When ports are created, they are assigned an encoding. The usual process to determine the initial encoding for a port is to take the value of the %default-port-encoding fluid.

Scheme Variable: %default-port-encoding

A fluid containing name of the encoding to be used by default for newly created ports (see Fluids and Dynamic States). As a special case, the value #f is equivalent to "ISO-8859-1".

The %default-port-encoding itself defaults to the encoding appropriate for the current locale, if setlocale has been called. See Locales, for more on locales and when you might need to call setlocale explicitly.

Some port types have other ways of determining their initial locales. String ports, for example, default to the UTF-8 encoding, in order to be able to represent all characters regardless of the current locale. File ports can optionally sniff their file for a coding: declaration; See File Ports. Binary ports might be initialized to the ISO-8859-1 encoding in which each codepoint between 0 and 255 corresponds to a byte with that value.

Currently, the ports only work with non-modal encodings. Most encodings are non-modal, meaning that the conversion of bytes to a string doesn’t depend on its context: the same byte sequence will always return the same string. A couple of modal encodings are in common use, like ISO-2022-JP and ISO-2022-KR, and they are not yet supported.

Each port also has an associated conversion strategy, which determines what to do when a Guile character can’t be converted to the port’s encoded character representation for output. There are three possible strategies: to raise an error, to replace the character with a hex escape, or to replace the character with a substitute character. Port conversion strategies are also used when decoding characters from an input port.

Scheme Procedure: port-conversion-strategy port
C Function: scm_port_conversion_strategy (port)

Returns the behavior of the port when outputting a character that is not representable in the port’s current encoding.

If port is #f, then the current default behavior will be returned. New ports will have this default behavior when they are created.

Scheme Procedure: set-port-conversion-strategy! port sym
C Function: scm_set_port_conversion_strategy_x (port, sym)

Sets the behavior of Guile when outputting a character that is not representable in the port’s current encoding, or when Guile encounters a decoding error when trying to read a character. sym can be either error, substitute, or escape.

If port is an open port, the conversion error behavior is set for that port. If it is #f, it is set as the default behavior for any future ports that get created in this thread.

As with port encodings, there is a fluid which determines the initial conversion strategy for a port.

Scheme Variable: %default-port-conversion-strategy

The fluid that defines the conversion strategy for newly created ports, and also for other conversion routines such as scm_to_stringn, scm_from_stringn, string->pointer, and pointer->string.

Its value must be one of the symbols described above, with the same semantics: error, substitute, or escape.

When Guile starts, its value is substitute.

Note that (set-port-conversion-strategy! #f sym) is equivalent to (fluid-set! %default-port-conversion-strategy sym).

As mentioned above, for an output port there are three possible port conversion strategies. The error strategy will throw an error when a nonconvertible character is encountered. The substitute strategy will replace nonconvertible characters with a question mark (‘?’). Finally the escape strategy will print nonconvertible characters as a hex escape, using the escaping that is recognized by Guile’s string syntax. Note that if the port’s encoding is a Unicode encoding, like UTF-8, then encoding errors are impossible.

For an input port, the error strategy will cause Guile to throw an error if it encounters an invalid encoding, such as might happen if you tried to read ISO-8859-1 as UTF-8. The error is thrown before advancing the read position. The substitute strategy will replace the bad bytes with a U+FFFD replacement character, in accordance with Unicode recommendations. When reading from an input port, the escape strategy is treated as if it were error.