Next: , Up: Shortcomings

18.1.1 Handling Multibyte and Varying-Width Characters

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

The IBM GNU/Linux Technology Center Internationalization Team has proposed patches to support internationalized diff. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.