This manual is last updated 25 August 2008 for version 1.10 of GNU Libidn.
Copyright © 2002, 2003, 2004, 2005, 2006, 2007, 2008 Simon Josefsson.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License”.
Appendices
Indices
GNU Libidn is a fully documented implementation of the Stringprep, Punycode and IDNA specifications defined by the IETF Internationalized Domain Names (IDN) working group, used for internationalized domain names. The native C, C# and Java libraries are available under the GNU Lesser General Public License version 2.1 (see GNU LGPL).
The library contains a generic Stringprep implementation that does Unicode 3.2 NFKC normalization, mapping and prohibitation of characters, and bidirectional character handling. Profiles for Nameprep, iSCSI, SASL and XMPP are included. Punycode and ASCII Compatible Encoding (ACE) via IDNA are supported. A mechanism to define Top-Level Domain (TLD) specific validation tables, and to compare strings against those tables, is included. Default tables for some TLDs are also included.
The Stringprep API consists of two main functions, one for converting data from the system's native representation into UTF-8, and one function to perform the Stringprep processing. Adding a new Stringprep profile for your application within the API is straightforward. The Punycode API consists of one encoding function and one decoding function. The IDNA API consists of the ToASCII and ToUnicode functions, as well as an high-level interface for converting entire domain names to and from the ACE encoded form. The TLD API consists of one set of functions to extract the TLD name from a domain string, one set of functions to locate the proper TLD table to use based on the TLD name, and core functions to validate a string against a TLD table, and some utility wrappers to perform all the steps in one call.
The library is used by, e.g., GNU SASL and Shishi to process user names and passwords. Libidn can be built into GNU Libc to enable a new system-wide getaddrinfo flag for IDN processing.
Libidn is developed for the GNU/Linux system, but runs on over 20 Unix platforms (including Solaris, IRIX, AIX, and Tru64) and Windows. Libidn is written in C and (parts of) the API is accessible from C, C#, C++, Emacs Lisp, Python and Java.
Also included is a command line tool, several self tests, code examples, and more, all licensed under the GNU General Public License version 3.0 (see GNU GPL).
This manual documents the library programming interface. All functions and data types provided by the library are explained. Included are also examples, and documentation for the command line tool idn that provide a quick interface to the library. The Emacs Lisp bindings for the library is also discussed.
The reader is assumed to possess basic familiarity with internationalization concepts and network programming in C or C++.
This manual can be used in several ways. If read from the beginning to the end, it gives a good introduction into the library and how it can be used in an application. Forward references are included where necessary. Later on, the manual can be used as a reference manual to get just the information needed about any particular interface of the library. Experienced programmers might want to start looking at the examples at the end of the manual (see Examples), and then only read up those parts of the interface which are unclear.
This library might have a couple of advantages over other libraries doing a similar job.
The following illustration show the components that make up Libidn, and how your application relates to the library. In the illustration, various components are shown as boxes. You see the generic StringPrep component, the various StringPrep profiles including Nameprep, the Punycode component, the IDNA component, and the TLD component. The arrows indicate aggregation, e.g., IDNA uses Punycode and Nameprep, and in turn Nameprep uses the generic StringPrep interface. The interfaces to all components are available for applications, no component within the library is hidden from the application.

Libidn has at some point in time been tested on the following platforms. Online build reports for each platforms and Libidn version is available at http://autobuild.josefsson.org/libidn/.
alphaev67-unknown-linux-gnu, alphaev6-unknown-linux-gnu,
arm-unknown-linux-gnu, armv4l-unknown-linux-gnu,
hppa-unknown-linux-gnu, hppa64-unknown-linux-gnu,
i686-pc-linux-gnu, ia64-unknown-linux-gnu,
m68k-unknown-linux-gnu, mips-unknown-linux-gnu,
mipsel-unknown-linux-gnu, powerpc-unknown-linux-gnu,
s390-ibm-linux-gnu, sparc-unknown-linux-gnu,
sparc64-unknown-linux-gnu.
armv4l-unknown-linux-gnu.
alphaev67-dec-osf5.1,
alphaev68-dec-osf5.1.
alphaev6-unknown-linux-gnu,
alphaev67-unknown-linux-gnu.
ia64-unknown-linux-gnu.
x86_64-unknown-linux-gnu (AMD64
Opteron “Melody”).
powerpc64-unknown-linux-gnu.
alphaev6-unknown-linux-gnu,
alphaev67-unknown-linux-gnu, ia64-unknown-linux-gnu.
i686-pc-linux-gnu.
i686-pc-linux-gnu.
i686-pc-linux-gnu.
i686-pc-linux-gnu.
mips-sgi-irix6.5.
rs6000-ibm-aix4.3.2.0.
i686-pc-cygwin.
ia64-hp-hpux11.22,
hppa2.0w-hp-hpux11.11.
sparc-sun-solaris2.7.
sparc-sun-solaris2.8.
sparc-sun-solaris2.9.
alpha-unknown-netbsd1.6,
i386-unknown-netbsdelf1.6.
alpha-unknown-openbsd3.1,
i386-unknown-openbsd3.1.
alpha-unknown-freebsd4.7,
alpha-unknown-freebsd4.8, i386-unknown-freebsd4.7,
i386-unknown-freebsd4.8.
powerpc-apple-darwin6.5.
powerpc-apple-darwin8.0.
m68k-uclinux-elf.
arm-linux.
i586-mingw32msvc.
If you use Libidn on, or port Libidn to, a new platform please report it to the author.
A mailing list where users of Libidn may help each other exists, and you can reach it by sending e-mail to help-libidn@gnu.org. Archives of the mailing list discussions, and an interface to manage subscriptions, is available through the World Wide Web at http://lists.gnu.org/mailman/listinfo/help-libidn.
Commercial support is available for users of GNU Libidn. The kind of support that can be purchased may include:
If you are interested, please write to:
Simon Josefsson Datakonsult Hagagatan 24 113 47 Stockholm Sweden E-mail: simon@josefsson.org
If your company provide support related to GNU Libidn and would like to be mentioned here, contact the author (see Bug Reports).
The package can be downloaded from several places, including:
ftp://alpha.gnu.org/pub/gnu/libidn/
The latest version is stored in a file, e.g., ‘libidn-1.10.tar.gz’ where the ‘1.10’ value is the highest version number in the directory.
The package is then extracted, configured and built like many other packages that use Autoconf. For detailed information on configuring and building it, refer to the INSTALL file that is part of the distribution archive.
Here is an example terminal session that download, configure, build and install the package. You will need a few basic tools, such as ‘sh’, ‘make’ and ‘cc’.
$ wget -q ftp://alpha.gnu.org/pub/gnu/libidn/libidn-1.10.tar.gz
$ tar xfz libidn-1.10.tar.gz
$ cd libidn-1.10/
$ ./configure
...
$ make
...
$ make install
...
After that Libidn should be properly installed and ready for use.
A few configure options may be relevant, summarized in the
table.
--enable-java--disable-tld--enable-csharp[=IMPL]C# port into a *.DLL file. See C# API, for
more information. Here, IMPL is pnet or mono,
indicating whether the PNET cscc compiler or the Mono
mcs compiler should be used, respectively.
For the complete list, refer to the output from configure
--help.
There are two ways to build Libidn on Windows: via MinGW or via Visual Studio C++.
With MinGW, you can build a Libidn DLL and use it from other applications. After installing MinGW (http://mingw.org/) follow the generic installation instructions (see Downloading and Installing). The DLL is installed by default.
For information on how to use the DLL in other applications, see: http://www.mingw.org/mingwfaq.shtml#faq-msvcdll.
You can build Libidn as a native Visual Studio C++ project. This allows you to build the code for other platforms that VS supports, such as Windows Mobile. You need Visual Studio 2005 or later, and a Perl interpreter such as ActiveState Perl.
First download and unpack the archive as described in the generic
installation instructions (see Downloading and Installing). Don't
run ./configure. Instead, start Visual Studio and open the
project file win32/libidn.sln inside the Libidn directory. You
should be able to build the project using VS.
Output libraries will be written into the win32/lib (or
win32/lib/debug for Debug versions) folder.
If you think you have found a bug in Libidn, please investigate it and report it.
Please make an effort to produce a self-contained report, with something definite that can be tested or debugged. Vague queries or piecemeal messages are difficult to act on and don't help the development effort.
If your bug report is good, we will do our best to help you to get a corrected version of the software; if the bug report is poor, we won't do anything about it (apart from asking you to send better bug reports).
If you think something in this manual is unclear, or downright incorrect, or if the language needs to be improved, please also send a note.
Send your bug report to:
If you want to submit a patch for inclusion – from solve a typo you discovered, up to adding support for a new feature – you should submit it as a bug report (see Bug Reports). There are some things that you can do to increase the chances for it to be included in the official package.
Unless your patch is very small (say, under 10 lines) we require that you assign the copyright of your work to the Free Software Foundation. This is to protect the freedom of the project. If you have not already signed papers, we will send you the necessary information when you submit your contribution.
For contributions that doesn't consist of actual programming code, the only guidelines are common sense. Use it.
For code contributions, a number of style guides will help you:
If you normally code using another coding standard, there is no problem, but you should use ‘indent’ to reformat the code (see GNU Indent) before submitting your work.
To use `Libidn', you have to perform some changes to your sources and the build system. The necessary changes are small and explained in the following sections. At the end of this chapter, it is described how the library is initialized, and how the requirements of the library are verified.
A faster way to find out how to adapt your application for use with `Libidn' may be to look at the examples at the end of this manual (see Examples).
The library contains a few independent parts, and each part export the interfaces (data types and functions) in a header file. You must include the appropriate header files in all programs using the library, either directly or through some other header file, like this:
#include <stringprep.h>
The header files and the functions they define are categorized as follows:
The name space of the stringprep part of Libidn is stringprep*
for function names, Stringprep* for data types and
STRINGPREP_* for other symbols. In addition,
_stringprep* is reserved for internal use and should never be
used by applications.
The name space of the punycode part of Libidn is punycode_* for
function names, Punycode* for data types and PUNYCODE_*
for other symbols. In addition, _punycode* is reserved for
internal use and should never be used by applications.
The name space of the IDNA part of Libidn is idna_* for
function names, Idna* for data types and IDNA_* for
other symbols. In addition, _idna* is reserved for internal
use and should never be used by applications.
The name space of the TLD part of Libidn is tld_* for function
names, Tld_* for data types and TLD_* for other symbols.
In addition, _tld* is reserved for internal use and should
never be used by applications.
The name space of the PR29 part of Libidn is pr29_* for
function names, Pr29_* for data types and PR29_* for
other symbols. In addition, _pr29* is reserved for internal
use and should never be used by applications.
Libidn is stateless and does not need any initialization.
It is often desirable to check that the version of `Libidn' used is indeed one which fits all requirements. Even with binary compatibility new features may have been introduced but due to problem with the dynamic linker an old version is actually used. So you may want to check that the version is okay right after program startup.
req_version: Required version number, or NULL.
Check that the the version of the library is at minimum the requested one and return the version string; return NULL if the condition is not satisfied. If a NULL is passed to this function, no check is done, but the version string is simply returned.
See
STRINGPREP_VERSIONfor a suitablereq_versionstring.Return value: Version string of run-time library, or NULL if the run-time library does not meet the required version number.
The normal way to use the function is to put something similar to the
following first in your main:
if (!stringprep_check_version (STRINGPREP_VERSION))
{
printf ("stringprep_check_version() failed:\n"
"Header file incompatible with shared library.\n");
exit(1);
}
If you want to compile a source file including e.g. the `idna.h' header file, you must make sure that the compiler can find it in the directory hierarchy. This is accomplished by adding the path to the directory in which the header file is located to the compilers include file search path (via the -I option).
However, the path to the include file is determined at the time the source is configured. To solve this problem, `Libidn' uses the external package pkg-config that knows the path to the include file and other configuration options. The options that need to be added to the compiler invocation at compile time are output by the --cflags option to pkg-config libidn. The following example shows how it can be used at the command line:
gcc -c foo.c `pkg-config libidn --cflags`
Adding the output of ‘pkg-config libidn --cflags’ to the compilers command line will ensure that the compiler can find e.g. the idna.h header file.
A similar problem occurs when linking the program with the library. Again, the compiler has to find the library files. For this to work, the path to the library files has to be added to the library search path (via the -L option). For this, the option --libs to pkg-config libidn can be used. For convenience, this option also outputs all other options that are required to link the program with the `libidn' libarary. The example shows how to link foo.o with the `libidn' library to a program foo.
gcc -o foo foo.o `pkg-config libidn --libs`
Of course you can also combine both examples to a single command by specifying both options to pkg-config:
gcc -o foo foo.c `pkg-config libidn --cflags --libs`
If your project uses Autoconf (see GNU Autoconf)
to check for installed libraries, you might find the following snippet
illustrative. It add a new configure parameter
--with-libidn, and check for idna.h and ‘-lidn’
(possibly below the directory specified as the optional argument to
--with-libidn), and define the CPP symbol
LIBIDN if the library is found. The default behaviour is to
search for the library and enable the functionality (that is, define
the symbol) when the library is found, but if you wish to make the
default behaviour of your package be that Libidn is not used (even if
it is installed on the system), change ‘libidn=yes’ to
‘libidn=no’ on the third line.
AC_ARG_WITH(libidn, AC_HELP_STRING([--with-libidn=[DIR]],
[Support IDN (needs GNU Libidn)]),
libidn=$withval, libidn=yes)
if test "$libidn" != "no"; then
if test "$libidn" != "yes"; then
LDFLAGS="${LDFLAGS} -L$libidn/lib"
CPPFLAGS="${CPPFLAGS} -I$libidn/include"
fi
AC_CHECK_HEADER(idna.h,
AC_CHECK_LIB(idn, stringprep_check_version,
[libidn=yes LIBS="${LIBS} -lidn"], libidn=no),
libidn=no)
fi
if test "$libidn" != "no" ; then
AC_DEFINE(LIBIDN, 1, [Define to 1 if you want IDN support.])
else
AC_MSG_WARN([Libidn not found])
fi
AC_MSG_CHECKING([if Libidn should be used])
AC_MSG_RESULT($libidn)
If you require that your users have installed pkg-config (which
I cannot recommend generally), the above can be done more easily as
follows.
AC_ARG_WITH(libidn, AC_HELP_STRING([--with-libidn=[DIR]],
[Support IDN (needs GNU Libidn)]),
libidn=$withval, libidn=yes)
if test "$libidn" != "no" ; then
PKG_CHECK_MODULES(LIBIDN, libidn >= 0.0.0, [libidn=yes], [libidn=no])
if test "$libidn" != "yes" ; then
libidn=no
AC_MSG_WARN([Libidn not found])
else
libidn=yes
AC_DEFINE(LIBIDN, 1, [Define to 1 if you want Libidn.])
fi
fi
AC_MSG_CHECKING([if Libidn should be used])
AC_MSG_RESULT($libidn)
The rest of this library makes extensive use of Unicode characters. In order to interface this library with the outside world, your application may need to make various Unicode transformations.
stringprep.hTo use the functions explained in this chapter, you need to include the file stringprep.h using:
#include <stringprep.h>
c: a ISO10646 character code
outbuf: output buffer, must have at least 6 bytes of space. If
NULL, the length will be computed and returned and nothing will be written tooutbuf.Converts a single character to UTF-8.
Return value: number of bytes written.
p: a pointer to Unicode character encoded as UTF-8
Converts a sequence of bytes encoded as UTF-8 to a Unicode character. If
pdoes not point to a valid UTF-8 encoded character, results are undefined.Return value: the resulting character.
str: a UCS-4 encoded string
len: the maximum length of
strto use. Iflen< 0, then the string is terminated with a 0 character.items_read: location to store number of characters read read, or
NULL.items_written: location to store number of bytes written or
NULL. The value here stored does not include the trailing 0 byte.Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.
Return value: a pointer to a newly allocated UTF-8 string. This value must be freed with
free(). If an error occurs,NULLwill be returned anderrorset.
str: a UTF-8 encoded string
len: the maximum length of
strto use. Iflen< 0, then the string is nul-terminated.items_written: location to store the number of characters in the result, or
NULL.Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input. This function does no error checking on the input.
Return value: a pointer to a newly allocated UCS-4 string. This value must be freed with
free().
str: a Unicode string.
len: length of
strarray, or -1 ifstris nul-terminated.Converts UCS4 string into UTF-8 and runs
stringprep_utf8_nfkc_normalize().Return value: a newly allocated Unicode string, that is the NFKC normalized form of
str.
str: a UTF-8 encoded string.
len: length of
str, in bytes, or -1 ifstris nul-terminated.Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character.
The normalization mode is NFKC (ALL COMPOSE). It standardizes differences that do not affect the text content, such as the above-mentioned accent representation. It standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. It returns a result with composed forms rather than a maximally decomposed form.
Return value: a newly allocated string, that is the NFKC normalized form of
str.
Find out current locale charset. The function respect the CHARSET environment variable, but typically uses nl_langinfo(CODESET) when it is supported. It fall back on "ASCII" if CHARSET isn't set and nl_langinfo isn't supported or return anything.
Note that this function return the application's locale's preferred charset (or thread's locale's preffered charset, if your system support thread-specific locales). It does not return what the system may be using. Thus, if you receive data from external sources you cannot in general use this function to guess what charset it is encoded in. Use stringprep_convert from the external representation into the charset returned by this function, to have data in the locale encoding.
Return value: Return the character set used by the current locale. It will never return NULL, but use "ASCII" as a fallback.
str: input zero-terminated string.
to_codeset: name of destination character set.
from_codeset: name of origin character set, as used by
str.Convert the string from one character set to another using the system's
iconv()function.Return value: Returns newly allocated zero-terminated string which is
strtranscoded into to_codeset.
str: input zero terminated string.
Convert string encoded in the locale's character set into UTF-8 by using
stringprep_convert().Return value: Returns newly allocated zero-terminated string which is
strtranscoded into UTF-8.
str: input zero terminated string.
Convert string encoded in UTF-8 into the locale's character set by using
stringprep_convert().Return value: Returns newly allocated zero-terminated string which is
strtranscoded into the locale's character set.
Stringprep describes a framework for preparing Unicode text strings in order to increase the likelihood that string input and string comparison work in ways that make sense for typical users throughout the world. The stringprep protocol is useful for protocol identifier values, company and personal names, internationalized domain names, and other text strings.
stringprep.hTo use the functions explained in this chapter, you need to include the file stringprep.h using:
#include <stringprep.h>
Further types and structures are defined for applications that want to specify their own stringprep profile. As these are fairly obscure, and by necessity tied to the implementation, we do not document them here. Look into the stringprep.h header file, and the profiles.c source code for the details.
Disable the NFKC normalization, as well as selecting the non-NFKC case folding tables. Usually the profile specifies BIDI and NFKC settings, and applications should not override it unless in special situations.
Disable the BIDI step. Usually the profile specifies BIDI and NFKC settings, and applications should not override it unless in special situations.
Make the library return with an error if string contains unassigned characters according to profile.
ucs4: input/output array with string to prepare.
len: on input, length of input array with Unicode code points, on exit, length of output array with Unicode code points.
maxucs4len: maximum length of input/output array.
flags: a
Stringprep_profile_flagsvalue, or 0.profile: pointer to
Stringprep_profileto use.Prepare the input UCS-4 string according to the stringprep profile, and write back the result to the input string.
The input is not required to be zero terminated (
ucs4[len] = 0). The output will not be zero terminated unlessucs4[len] = 0. Instead, seestringprep_4zi()if your input is zero terminated or if you want the output to be.Since the stringprep operation can expand the string,
maxucs4lenindicate how large the buffer holding the string is. This function will not read or write to code points outside that size.The
flagsare one ofStringprep_profile_flagsvalues, or 0.The
profilecontain theStringprep_profileinstructions to perform. Your application can define new profiles, possibly re-using the generic stringprep tables that always will be part of the library, or use one of the currently supported profiles.Return value: Returns
STRINGPREP_OKiff successful, or anStringprep_rcerror code.
ucs4: input/output array with zero terminated string to prepare.
maxucs4len: maximum length of input/output array.
flags: a
Stringprep_profile_flagsvalue, or 0.profile: pointer to
Stringprep_profileto use.Prepare the input zero terminated UCS-4 string according to the stringprep profile, and write back the result to the input string.
Since the stringprep operation can expand the string,
maxucs4lenindicate how large the buffer holding the string is. This function will not read or write to code points outside that size.The
flagsare one ofStringprep_profile_flagsvalues, or 0.The
profilecontain theStringprep_profileinstructions to perform. Your application can define new profiles, possibly re-using the generic stringprep tables that always will be part of the library, or use one of the currently supported profiles.Return value: Returns
STRINGPREP_OKiff successful, or anStringprep_rcerror code.
in: input/ouput array with string to prepare.
maxlen: maximum length of input/output array.
flags: a
Stringprep_profile_flagsvalue, or 0.profile: pointer to
Stringprep_profileto use.Prepare the input zero terminated UTF-8 string according to the stringprep profile, and write back the result to the input string.
Note that you must convert strings entered in the systems locale into UTF-8 before using this function, see
stringprep_locale_to_utf8().Since the stringprep operation can expand the string,
maxlenindicate how large the buffer holding the string is. This function will not read or write to characters outside that size.The
flagsare one ofStringprep_profile_flagsvalues, or 0.The
profilecontain theStringprep_profileinstructions to perform. Your application can define new profiles, possibly re-using the generic stringprep tables that always will be part of the library, or use one of the currently supported profiles.Return value: Returns
STRINGPREP_OKiff successful, or an error code.
in: input array with UTF-8 string to prepare.
out: output variable with pointer to newly allocate string.
profile: name of stringprep profile to use.
flags: a
Stringprep_profile_flagsvalue, or 0.Prepare the input zero terminated UTF-8 string according to the stringprep profile, and return the result in a newly allocated variable.
Note that you must convert strings entered in the systems locale into UTF-8 before using this function, see
stringprep_locale_to_utf8().The output
outvariable must be deallocated by the caller.The
flagsare one ofStringprep_profile_flagsvalues, or 0.The
profilespecifies the name of the stringprep profile to use. It must be one of the internally supported stringprep profiles.Return value: Returns
STRINGPREP_OKiff successful, or an error code.
rc: a
Stringprep_rcreturn code.Convert a return code integer to a text string. This string can be used to output a diagnostic message to the user.
STRINGPREP_OK: Successful operation. This value is guaranteed to always be zero, the remaining ones are only guaranteed to hold non-zero values, for logical comparison purposes.
STRINGPREP_CONTAINS_UNASSIGNED: String contain unassigned Unicode code points, which is forbidden by the profile.
STRINGPREP_CONTAINS_PROHIBITED: String contain code points prohibited by the profile.
STRINGPREP_BIDI_BOTH_L_AND_RAL: String contain code points with conflicting bidirection category.
STRINGPREP_BIDI_LEADTRAIL_NOT_RAL: Leading and trailing character in string not of proper bidirectional category.
STRINGPREP_BIDI_CONTAINS_PROHIBITED: Contains prohibited code points detected by bidirectional code.
STRINGPREP_TOO_SMALL_BUFFER: Buffer handed to function was too small. This usually indicate a problem in the calling application.
STRINGPREP_PROFILE_ERROR: The stringprep profile was inconsistent. This usually indicate an internal error in the library.
STRINGPREP_FLAG_ERROR: The supplied flag conflicted with profile. This usually indicate a problem in the calling application.
STRINGPREP_UNKNOWN_PROFILE: The supplied profile name was not known to the library.
STRINGPREP_NFKC_FAILED: The Unicode NFKC operation failed. This usually indicate an internal error in the library.
STRINGPREP_MALLOC_ERROR: The
malloc()was out of memory. This is usually a fatal error.Return value: Returns a pointer to a statically allocated string containing a description of the error with the return code
rc.
in: input/ouput array with string to prepare.
maxlen: maximum length of input/output array.
Prepare the input UTF-8 string according to the nameprep profile. The AllowUnassigned flag is false, use
stringprep_nameprepfor true AllowUnassigned. Returns 0 iff successful, or an error code.
in: input/ouput array with string to prepare.
maxlen: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft iSCSI stringprep profile. Returns 0 iff successful, or an error code.
in: input/ouput array with string to prepare.
maxlen: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft SASL ANONYMOUS profile. Returns 0 iff successful, or an error code.
in: input/ouput array with string to prepare.
maxlen: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft XMPP node identifier profile. Returns 0 iff successful, or an error code.
in: input/ouput array with string to prepare.
maxlen: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft XMPP resource identifier profile. Returns 0 iff successful, or an error code.
Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications. It uniquely and reversibly transforms a Unicode string into an ASCII string. ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens). A general algorithm called Bootstring allows a string of basic code points to uniquely represent any string of code points drawn from a larger set. Punycode is an instance of Bootstring that uses particular parameter values, appropriate for IDNA.
punycode.hTo use the functions explained in this chapter, you need to include the file punycode.h using:
#include <punycode.h>
The punycode function uses a special type to denote Unicode code points. It is guaranteed to always be a 32 bit unsigned integer.
A unsigned integer that hold Unicode code points.
Note that the current implementation will fail if the
input_length exceed 4294967295 (the size of
punycode_uint). This restriction may be removed in the future.
Meanwhile applications are encouraged to not depend on this problem,
and use sizeof to initialize input_length and
output_length.
The functions provided are the following two entry points:
input_length: The number of code points in the
inputarray and the number of flags in thecase_flagsarray.input: An array of code points. They are presumed to be Unicode code points, but that is not strictly REQUIRED. The array contains code points, not code units. UTF-16 uses code units D800 through DFFF to refer to code points 10000..10FFFF. The code points D800..DFFF do not occur in any valid Unicode string. The code points that can occur in Unicode strings (0..D7FF and E000..10FFFF) are also called Unicode scalar values.
case_flags: A
NULLpointer or an array of boolean values parallel to theinputarray. Nonzero (true, flagged) suggests that the corresponding Unicode character be forced to uppercase after being decoded (if possible), and zero (false, unflagged) suggests that it be forced to lowercase (if possible). ASCII code points (0..7F) are encoded literally, except that ASCII letters are forced to uppercase or lowercase according to the corresponding case flags. Ifcase_flagsis aNULLpointer then ASCII letters are left as they are, and other code points are treated as unflagged.output_length: The caller passes in the maximum number of ASCII code points that it can receive. On successful return it will contain the number of ASCII code points actually output.
output: An array of ASCII code points. It is *not* null-terminated; it will contain zeros if and only if the
inputcontains zeros. (Of course the caller can leave room for a terminator and add one if needed.)Converts a sequence of code points (presumed to be Unicode code points) to Punycode.
Return value: The return value can be any of the
Punycode_statusvalues defined above exceptPUNYCODE_BAD_INPUT. If notPUNYCODE_SUCCESS, thenoutput_sizeandoutputmight contain garbage.
input_length: The number of ASCII code points in the
inputarray.input: An array of ASCII code points (0..7F).
output_length: The caller passes in the maximum number of code points that it can receive into the
outputarray (which is also the maximum number of flags that it can receive into thecase_flagsarray, ifcase_flagsis not aNULLpointer). On successful return it will contain the number of code points actually output (which is also the number of flags actually output, if case_flags is not a null pointer). The decoder will never need to output more code points than the number of ASCII code points in the input, because of the way the encoding is defined. The number of code points output cannot exceed the maximum possible value of a punycode_uint, even if the suppliedoutput_lengthis greater than that.output: An array of code points like the input argument of
punycode_encode()(see above).case_flags: A
NULLpointer (if the flags are not needed by the caller) or an array of boolean values parallel to theoutputarray. Nonzero (true, flagged) suggests that the corresponding Unicode character be forced to uppercase by the caller (if possible), and zero (false, unflagged) suggests that it be forced to lowercase (if possible). ASCII code points (0..7F) are output already in the proper case, but their flags will be set appropriately so that applying the flags would be harmless.Converts Punycode to a sequence of code points (presumed to be Unicode code points).
Return value: The return value can be any of the
Punycode_statusvalues defined above. If notPUNYCODE_SUCCESS, thenoutput_length,output, andcase_flagsmight contain garbage.
rc: an
Punycode_statusreturn code.Convert a return code integer to a text string. This string can be used to output a diagnostic message to the user.
PUNYCODE_SUCCESS: Successful operation. This value is guaranteed to always be zero, the remaining ones are only guaranteed to hold non-zero values, for logical comparison purposes.
PUNYCODE_BAD_INPUT: Input is invalid.
PUNYCODE_BIG_OUTPUT: Output would exceed the space provided.
PUNYCODE_OVERFLOW: Input needs wider integers to process.
Return value: Returns a pointer to a statically allocated string containing a description of the error with the return code
rc.
Until now, there has been no standard method for domain names to use characters outside the ASCII repertoire. The IDNA document defines internationalized domain names (IDNs) and a mechanism called IDNA for handling them in a standard fashion. IDNs use characters drawn from a large repertoire (Unicode), but IDNA allows the non-ASCII characters to be represented using only the ASCII characters already allowed in so-called host names today. This backward-compatible representation is required in existing protocols like DNS, so that IDNs can be introduced with no changes to the existing infrastructure. IDNA is only meant for processing domain names, not free text.
idna.hTo use the functions explained in this chapter, you need to include the file idna.h using:
#include <idna.h>
The IDNA flags parameter can take on the following values, or a
bit-wise inclusive or of any subset of the parameters:
Check output to make sure it is a STD3 conforming host name.
The idea behind the IDNA function names are as follows: the
idna_to_ascii_4i and idna_to_unicode_44i functions are
the core IDNA primitives. The 4 indicate that the function
takes UCS-4 strings (i.e., Unicode code points encoded in a 32-bit
unsigned integer type) of the specified length. The i indicate
that the data is written “inline” into the buffer. This means the
caller is responsible for allocating (and deallocating) the string,
and providing the library with the allocated length of the string.
The output length is written in the output length variable. The
remaining functions all contain the z indicator, which means
the strings are zero terminated. All output strings are allocated by
the library, and must be deallocated by the caller. The 4
indicator again means that the string is UCS-4, the 8 means the
strings are UTF-8 and the l indicator means the strings are
encoded in the encoding used by the current locale.
The functions provided are the following entry points:
in: input array with unicode code points.
inlen: length of input array with unicode code points.
out: output zero terminated string that must have room for at least 63 characters plus the terminating zero.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.The ToASCII operation takes a sequence of Unicode code points that make up one domain label and transforms it into a sequence of code points in the ASCII range (0..7F). If ToASCII succeeds, the original sequence and the resulting sequence are equivalent labels.
It is important to note that the ToASCII operation can fail. ToASCII fails if any step of it fails. If any step of the ToASCII operation fails on any label in a domain name, that domain name MUST NOT be used as an internationalized domain name. The method for deadling with this failure is application-specific.
The inputs to ToASCII are a sequence of code points, the AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of ToASCII is either a sequence of ASCII code points or a failure condition.
ToASCII never alters a sequence of code points that are all in the ASCII range to begin with (although it could fail). Applying the ToASCII operation multiple times has exactly the same effect as applying it just once.
Return value: Returns 0 on success, or an
Idna_rcerror code.
in: input array with unicode code points.
inlen: length of input array with unicode code points.
out: output array with unicode code points.
outlen: on input, maximum size of output array with unicode code points, on exit, actual size of output array with unicode code points.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.The ToUnicode operation takes a sequence of Unicode code points that make up one domain label and returns a sequence of Unicode code points. If the input sequence is a label in ACE form, then the result is an equivalent internationalized label that is not in ACE form, otherwise the original sequence is returned unaltered.
ToUnicode never fails. If any step fails, then the original input sequence is returned immediately in that step.
The Punycode decoder can never output more code points than it inputs, but Nameprep can, and therefore ToUnicode can. Note that the number of octets needed to represent a sequence of code points depends on the particular character encoding used.
The inputs to ToUnicode are a sequence of code points, the AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of ToUnicode is always a sequence of Unicode code points.
Return value: Returns
Idna_rcerror condition, but it must only be used for debugging purposes. The output buffer is always guaranteed to contain the correct data according to the specification (sans malloc induced errors). NB! This means that you normally ignore the return code from this function, as checking it means breaking the standard.
input: zero terminated input Unicode string.
output: pointer to newly allocated output string.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert UCS-4 domain name to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero terminated input UTF-8 string.
output: pointer to newly allocated output string.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert UTF-8 domain name to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero terminated input string encoded in the current locale's character set.
output: pointer to newly allocated output string.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert domain name in the locale's encoding to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero-terminated Unicode string.
output: pointer to newly allocated output Unicode string.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert possibly ACE encoded domain name in UCS-4 format into a UCS-4 string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero-terminated UTF-8 string.
output: pointer to newly allocated output Unicode string.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert possibly ACE encoded domain name in UTF-8 format into a UCS-4 string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero-terminated UTF-8 string.
output: pointer to newly allocated output UTF-8 string.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert possibly ACE encoded domain name in UTF-8 format into a UTF-8 string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero-terminated UTF-8 string.
output: pointer to newly allocated output string encoded in the current locale's character set.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert possibly ACE encoded domain name in UTF-8 format into a string encoded in the current locale's character set. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
input: zero-terminated string encoded in the current locale's character set.
output: pointer to newly allocated output string encoded in the current locale's character set.
flags: an
Idna_flagsvalue, e.g.,IDNA_ALLOW_UNASSIGNEDorIDNA_USE_STD3_ASCII_RULES.Convert possibly ACE encoded domain name in the locale's character set into a string encoded in the current locale's character set. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESSon success, or error code.
rc: an
Idna_rcreturn code.Convert a return code integer to a text string. This string can be used to output a diagnostic message to the user.
IDNA_SUCCESS: Successful operation. This value is guaranteed to always be zero, the remaining ones are only guaranteed to hold non-zero values, for logical comparison purposes.
IDNA_STRINGPREP_ERROR: Error during string preparation.
IDNA_PUNYCODE_ERROR: Error during punycode operation.
IDNA_CONTAINS_NON_LDH: For IDNA_USE_STD3_ASCII_RULES, indicate that the string contains non-LDH ASCII characters.
IDNA_CONTAINS_MINUS: For IDNA_USE_STD3_ASCII_RULES, indicate that the string contains a leading or trailing hyphen-minus (U+002D).
IDNA_INVALID_LENGTH: The final output string is not within the (inclusive) range 1 to 63 characters.
IDNA_NO_ACE_PREFIX: The string does not contain the ACE prefix (for ToUnicode).
IDNA_ROUNDTRIP_VERIFY_ERROR: The ToASCII operation on output string does not equal the input.
IDNA_CONTAINS_ACE_PREFIX: The input contains the ACE prefix (for ToASCII).
IDNA_ICONV_ERROR: Could not convert string in locale encoding.
IDNA_MALLOC_ERROR: Could not allocate buffer (this is typically a fatal error).
IDNA_DLOPEN_ERROR: Could not dlopen the libcidn DSO (only used internally in libc).
Return value: Returns a pointer to a statically allocated string containing a description of the error with the return code
rc.
Organizations that manage some Top Level Domains (TLDs) have published tables with characters they accept within the domain. The reason may be to reduce complexity that come from using the full Unicode range, and to protect themselves from future (backwards incompatible) changes in the IDN or Unicode specifications. Libidn implement an infrastructure for defining and checking strings against such tables. Libidn also ship some tables from TLDs that we have managed to get permission to use them from. Because these tables are even less static than Unicode or StringPrep tables, it is likely that they will be updated from time to time (even in backwards incompatibe ways). The Libidn interface provide a “version” field for each TLD table, which can be compared for equality to guarantee the same operation over time.
From a design point of view, you can regard the TLD tables for IDN as the “localization” step that come after the “internationalization” step provided by the IETF standards.
The TLD functionality rely on up-to-date tables. The latest version of Libidn aim to provide these, but tables with unclear copying conditions, or generally experimental tables, are not included. Some such tables can be found at http://tldchk.berlios.de.
tld.hTo use the functions explained in this chapter, you need to include the file tld.h using:
#include <tld.h>
in: Array of unicode code points to process. Does not need to be zero terminated.
inlen: Number of unicode code points.
errpos: Position of offending character is returned here.
tld: A
Tld_tabledata structure representing the restrictions for which the input should be tested.Test each of the code points in
infor whether or not they are allowed by the data structure intld, return the position of the first character for which this is not the case inerrpos.Return value: Returns the
Tld_rcvalueTLD_SUCCESSif all code points are valid or whentldis null,TLD_INVALIDif a character is not allowed, or additional error codes on general failure conditions.
in: Zero terminated array of unicode code points to process.
errpos: Position of offending character is returned here.
tld: A
Tld_tabledata structure representing the restrictions for which the input should be tested.Test each of the code points in
infor whether or not they are allowed by the data structure intld, return the position of the first character for which this is not the case inerrpos.Return value: Returns the
Tld_rcvalueTLD_SUCCESSif all code points are valid or whentldis null,TLD_INVALIDif a character is not allowed, or additional error codes on general failure conditions.
in: Array of unicode code points to process. Does not need to be zero terminated.
inlen: Number of unicode code points.
out: Zero terminated ascii result string pointer.
Isolate the top-level domain of
inand return it as an ASCII string inout.Return value: Return
TLD_SUCCESSon success, or the correspondingTld_rcerror code otherwise.
in: Zero terminated array of unicode code points to process.
out: Zero terminated ascii result string pointer.
Isolate the top-level domain of
inand return it as an ASCII string inout.Return value: Return
TLD_SUCCESSon success, or the correspondingTld_rcerror code otherwise.
in: Zero terminated character array to process.
out: Zero terminated ascii result string pointer.
Isolate the top-level domain of
inand return it as an ASCII string inout. The input stringinmay be UTF-8, ISO-8859-1 or any ASCII compatible character encoding.Return value: Return
TLD_SUCCESSon success, or the correspondingTld_rcerror code otherwise.
tld: TLD name (e.g. "com") as zero terminated ASCII byte string.
tables: Zero terminated array of
Tld_tableinfo-structures for TLDs.Get the TLD table for a named TLD by searching through the given TLD table array.
Return value: Return structure corresponding to TLD
tldby going thrutables, or returnNULLif no such structure is found.
tld: TLD name (e.g. "com") as zero terminated ASCII byte string.
overrides: Additional zero terminated array of
Tld_tableinfo-structures for TLDs, orNULLto only use library deault tables.Get the TLD table for a named TLD, using the internal defaults, possibly overrided by the (optional) supplied tables.
Return value: Return structure corresponding to TLD
tld_str, first looking throughoverridesthen thru built-in list, orNULLif no such structure found.
in: Array of unicode code points to process. Does not need to be zero terminated.
inlen: Number of unicode code points.
errpos: Position of offending character is returned here.
overrides: A
Tld_tablearray of additional domain restriction structures that complement and supersede the built-in information.Test each of the code points in
infor whether or not they are allowed by the information inoverridesor by the built-in TLD restriction data. When data for the same TLD is available both internally and inoverrides, the information inoverridestakes precedence. If several entries for a specific TLD are found, the first one is used. IfoverridesisNULL, only the built-in information is used. The position of the first offending character is returned inerrpos.Return value: Returns the
Tld_rcvalueTLD_SUCCESSif all code points are valid or whentldis null,TLD_INVALIDif a character is not allowed, or additional error codes on general failure conditions.
in: Zero-terminated array of unicode code points to process.
errpos: Position of offending character is returned here.
overrides: A
Tld_tablearray of additional domain restriction structures that complement and supersede the built-in information.Test each of the code points in
infor whether or not they are allowed by the information inoverridesor by the built-in TLD restriction data. When data for the same TLD is available both internally and inoverrides, the information inoverridestakes precedence. If several entries for a specific TLD are found, the first one is used. IfoverridesisNULL, only the built-in information is used. The position of the first offending character is returned inerrpos.Return value: Returns the
Tld_rcvalueTLD_SUCCESSif all code points are valid or whentldis null,TLD_INVALIDif a character is not allowed, or additional error codes on general failure conditions.
in: Zero-terminated UTF8 string to process.
errpos: Position of offending character is returned here.
overrides: A
Tld_tablearray of additional domain restriction structures that complement and supersede the built-in information.Test each of the characters in
infor whether or not they are allowed by the information inoverridesor by the built-in TLD restriction data. When data for the same TLD is available both internally and inoverrides, the information inoverridestakes precedence. If several entries for a specific TLD are found, the first one is used. IfoverridesisNULL, only the built-in information is used. The position of the first offending character is returned inerrpos. Note that the error position refers to the decoded character offset rather than the byte position in the string.Return value: Returns the
Tld_rcvalueTLD_SUCCESSif all characters are valid or whentldis null,TLD_INVALIDif a character is not allowed, or additional error codes on general failure conditions.
in: Zero-terminated string in the current locales encoding to process.
errpos: Position of offending character is returned here.
overrides: A
Tld_tablearray of additional domain restriction structures that complement and supersede the built-in information.Test each of the characters in
infor whether or not they are allowed by the information inoverridesor by the built-in TLD restriction data. When data for the same TLD is available both internally and inoverrides, the information inoverridestakes precedence. If several entries for a specific TLD are found, the first one is used. IfoverridesisNULL, only the built-in information is used. The position of the first offending character is returned inerrpos. Note that the error position refers to the decoded character offset rather than the byte position in the string.Return value: Returns the
Tld_rcvalueTLD_SUCCESSif all characters are valid or whentldis null,TLD_INVALIDif a character is not allowed, or additional error codes on general failure conditions.
rc: tld return code
Convert a return code integer to a text string. This string can be used to output a diagnostic message to the user.
TLD_SUCCESS: Successful operation. This value is guaranteed to always be zero, the remaining ones are only guaranteed to hold non-zero values, for logical comparison purposes.
TLD_INVALID: Invalid character found.
TLD_NODATA: No input data was provided.
TLD_MALLOC_ERROR: Error during memory allocation.
TLD_ICONV_ERROR: Error during iconv string conversion.
TLD_NO_TLD: No top-level domain found in domain string.
Return value: Returns a pointer to a statically allocated string containing a description of the error with the return code
rc.
A deficiency in the specification of Unicode Normalization Forms has been found. The consequence is that some strings can be normalized into different strings by different implementations. In other words, two different implementations may return different output for the same input (because the interpretation of the specification is ambiguous). Further, an implementation invoked again on the one of the output strings may return a different string (because one of the interpretation of the ambiguous specification make normalization non-idempotent). Fortunately, only a select few character sequence exhibit this problem, and none of them are expected to occur in natural languages (due to different linguistic uses of the involved characters).
A full discussion of the problem may be found at:
http://www.unicode.org/review/pr-29.html
The PR29 functions below allow you to detect the problem sequence. So when would you want to use these functions? For most applications, such as those using Nameprep for IDN, this is likely only to be an interoperability problem. Thus, you may not want to care about it, as the character sequences will rarely occur naturally. However, if you are using a profile, such as SASLPrep, to process authentication tokens; authorization tokens; or passwords, there is a real danger that attackers may try to use the peculiarities in these strings to attack parts of your system. As only a small number of strings, and no naturally occurring strings, exhibit this problem, the conservative approach of rejecting the strings is recommended. If this approach is not used, you should instead verify that all parts of your system, that process the tokens and passwords, use a NFKC implementation that produce the same output for the same input.
Technically inclined readers may be interested in knowing more about the implementation aspects of the PR29 flaw. See PR29 discussion.
pr29.hTo use the functions explained in this chapter, you need to include the file pr29.h using:
#include <pr29.h>
in: input array with unicode code points.
len: length of input array with unicode code points.
Check the input to see if it may be normalized into different strings by different NFKC implementations, due to an anomaly in the NFKC specifications.
Return value: Returns the
Pr29_rcvaluePR29_SUCCESSon success, andPR29_PROBLEMif the input sequence is a "problem sequence" (i.e., may be normalized into different strings by different implementations).
in: zero terminated array of Unicode code points.
Check the input to see if it may be normalized into different strings by different NFKC implementations, due to an anomaly in the NFKC specifications.
Return value: Returns the
Pr29_rcvaluePR29_SUCCESSon success, andPR29_PROBLEMif the input sequence is a "problem sequence" (i.e., may be normalized into different strings by different implementations).
in: zero terminated input UTF-8 string.
Check the input to see if it may be normalized into different strings by different NFKC implementations, due to an anomaly in the NFKC specifications.
Return value: Returns the
Pr29_rcvaluePR29_SUCCESSon success, andPR29_PROBLEMif the input sequence is a "problem sequence" (i.e., may be normalized into different strings by different implementations), orPR29_STRINGPREP_ERRORif there was a problem converting the string from UTF-8 to UCS-4.
rc: an
Pr29_rcreturn code.Convert a return code integer to a text string. This string can be used to output a diagnostic message to the user.
PR29_SUCCESS: Successful operation. This value is guaranteed to always be zero, the remaining ones are only guaranteed to hold non-zero values, for logical comparison purposes.
PR29_PROBLEM: A problem sequence was encountered.
PR29_STRINGPREP_ERROR: The character set conversion failed (only for
pr29_8()andpr29_8z()).Return value: Returns a pointer to a statically allocated string containing a description of the error with the return code
rc.
This chapter contains example code which illustrate how `Libidn' can be used when writing your own application.
This example demonstrates how the stringprep functions are used.
/* example.c --- Example code showing how to use stringprep().
* Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Simon Josefsson
*
* This file is part of GNU Libidn.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h> /* setlocale() */
#include <stringprep.h>
/*
* Compiling using libtool and pkg-config is recommended:
*
* $ libtool cc -o example example.c `pkg-config --cflags --libs libidn`
* $ ./example
* Input string encoded as `ISO-8859-1': ยช
* Before locale2utf8 (length 2): aa 0a
* Before stringprep (length 3): c2 aa 0a
* After stringprep (length 2): 61 0a
* $
*
*/
int
main (int argc, char *argv[])
{
char buf[BUFSIZ];
char *p;
int rc;
size_t i;
setlocale (LC_ALL, "");
printf ("Input string encoded as `%s': ", stringprep_locale_charset ());
fflush (stdout);
fgets (buf, BUFSIZ, stdin);
printf ("Before locale2utf8 (length %d): ", strlen (buf));
for (i = 0; i < strlen (buf); i++)
printf ("%02x ", buf[i] & 0xFF);
printf ("\n");
p = stringprep_locale_to_utf8 (buf);
if (p)
{
strcpy (buf, p);
free (p);
}
else
printf ("Could not convert string to UTF-8, continuing anyway...\n");
printf ("Before stringprep (length %d): ", strlen (buf));
for (i = 0; i < strlen (buf); i++)
printf ("%02x ", buf[i] & 0xFF);
printf ("\n");
rc = stringprep (buf, BUFSIZ, 0, stringprep_nameprep);
if (rc != STRINGPREP_OK)
printf ("Stringprep failed (%d): %s\n", rc, stringprep_strerror (rc));
else
{
printf ("After stringprep (length %d): ", strlen (buf));
for (i = 0; i < strlen (buf); i++)
printf ("%02x ", buf[i] & 0xFF);
printf ("\n");
}
return 0;
}
This example demonstrates how the punycode functions are used.
/* example2.c --- Example code showing how to use punycode.
* Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007 Simon Josefsson
* Copyright (C) 2002 Adam M. Costello
*
* This file is part of GNU Libidn.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
*/
#include <locale.h> /* setlocale() */
/*
* This file is derived from RFC 3492 written by Adam M. Costello.
*
* Disclaimer and license: Regarding this entire document or any
* portion of it (including the pseudocode and C code), the author
* makes no guarantees and is not responsible for any damage resulting
* from its use. The author grants irrevocable permission to anyone
* to use, modify, and distribute it in any way that does not diminish
* the rights of anyone else to use, modify, and distribute it,
* provided that redistributed derivative works do not contain
* misleading author or version information. Derivative works need
* not be licensed under similar terms.
*
*/
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <punycode.h>
/* For testing, we'll just set some compile-time limits rather than */
/* use malloc(), and set a compile-time option rather than using a */
/* command-line option. */
enum
{
unicode_max_length = 256,
ace_max_length = 256
};
static void
usage (char **argv)
{
fprintf (stderr,
"\n"
"%s -e reads code points and writes a Punycode string.\n"
"%s -d reads a Punycode string and writes code points.\n"
"\n"
"Input and output are plain text in the native character set.\n"
"Code points are in the form u+hex separated by whitespace.\n"
"Although the specification allows Punycode strings to contain\n"
"any characters from the ASCII repertoire, this test code\n"
"supports only the printable characters, and needs the Punycode\n"
"string to be followed by a newline.\n"
"The case of the u in u+hex is the force-to-uppercase flag.\n",
argv[0], argv[0]);
exit (EXIT_FAILURE);
}
static void
fail (const char *msg)
{
fputs (msg, stderr);
exit (EXIT_FAILURE);
}
static const char too_big[] =
"input or output is too large, recompile with larger limits\n";
static const char invalid_input[] = "invalid input\n";
static const char overflow[] = "arithmetic overflow\n";
static const char io_error[] = "I/O error\n";
/* The following string is used to convert printable */
/* characters between ASCII and the native charset: */
static const char print_ascii[] = "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" " !\"#$%&'()*+,-./" "0123456789:;<=>?" "\0x40" /* at sign */
"ABCDEFGHIJKLMNO"
"PQRSTUVWXYZ[\\]^_" "`abcdefghijklmno" "pqrstuvwxyz{|}~\n";
int
main (int argc, char **argv)
{
enum punycode_status status;
int r;
size_t input_length, output_length, j;
unsigned char case_flags[unicode_max_length];
setlocale (LC_ALL, "");
if (argc != 2)
usage (argv);
if (argv[1][0] != '-')
usage (argv);
if (argv[1][2] != 0)
usage (argv);
if (argv[1][1] == 'e')
{
uint32_t input[unicode_max_length];
unsigned long codept;
char output[ace_max_length + 1], uplus[3];
int c;
/* Read the input code points: */
input_length = 0;
for (;;)
{
r = scanf ("%2s%lx", uplus, &codept);
if (ferror (stdin))
fail (io_error);
if (r == EOF || r == 0)
break;
if (r != 2 || uplus[1] != '+' || codept > (uint32_t) - 1)
{
fail (invalid_input);
}
if (input_length == unicode_max_length)
fail (too_big);
if (uplus[0] == 'u')
case_flags[input_length] = 0;
else if (uplus[0] == 'U')
case_flags[input_length] = 1;
else
fail (invalid_input);
input[input_length++] = codept;
}
/* Encode: */
output_length = ace_max_length;
status = punycode_encode (input_length, input, case_flags,
&output_length, output);
if (status == punycode_bad_input)
fail (invalid_input);
if (status == punycode_big_output)
fail (too_big);
if (status