![]() | ![]() |
|
The Unicode HOWTOBruno Haible, <haible@clisp.cons.org>v1.0, 23 January 2001This document describes how to change your Linux system so it uses UTF-8 as text encoding. - This is work in progress. Any tips, patches, pointers, URLs are very welcome.
1. Introduction
2. Display setup
3. Locale setup
4. Specific applications
5. Printing
6. Making your programs Unicode aware
7. Other sources of information1. Introduction
1.1 Why Unicode?
People in different countries use different characters to represent the words of their native languages. Nowadays most applications, including email systems and web browsers, are 8-bit clean, i.e. they can operate on and display text correctly provided that it is represented in an 8-bit character set, like ISO-8859-1. There are far more than 256 characters in the world - think of cyrillic, hebrew, arabic, chinese, japanese, korean and thai -, and new characters are being invented now and then. The problems that come up for users are:
The solution of this problem is the adoption of a world-wide usable character
set. This character set is Unicode
http://www.unicode.org/.
For more info about Unicode, do `
1.2 Unicode encodings
This reduces the user's problem of dealing with character sets to a technical problem: How to transport Unicode characters using the 8-bit bytes? 8-bit units are the smallest addressing units of most computers and also the unit used by TCP/IP network connections. The use of 1 byte to represent 1 character is, however, an accident of history, caused by the fact that computer development started in Europe and the U.S. where 96 characters were found to be sufficient for a long time. There are basically four ways to encode Unicode characters in bytes:
The space requirements for encoding a text, compared to encodings currently in use (8 bit per character for European languages, more for Chinese/Japanese/Korean), is as follows. This has an influence on disk storage space and network download speed (when no form of compression is used).
Given the penalty for US and European documents caused by UCS-2, UTF-16, and UCS-4, it seems unlikely that these encodings have a potential for wide-scale use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at least), yet this encoding has not been widely adopted for documents - SJIS remains prevalent in Japan. UTF-8 on the other hand has the potential for wide-scale use, since it doesn't penalize US and European users, and since many text processing programs don't need to be changed for UTF-8 support. In the following, we will describe how to change your Linux system so it uses UTF-8 as text encoding.
Footnotes for C/C++ developers
The Microsoft Win32 approach makes it easy for developers to produce
Unicode versions of their programs: You "#define UNICODE" at the top
of your program and then change many occurrences of ` Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA character set registry http://www.isi.edu/in-notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify". Network byte order is big endian. And RFC 2152 is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." Whereas Microsoft, in its C/C++ development tools, recommends to use machine-dependent endianness (i.e. little endian on ix86 processors) and either a byte-order mark at the beginning of the document, or some statistical heuristics(!). The UTF-8 approach on the other hand keeps `
1.3 Related resources
Markus Kuhn's very up-to-date resource list: Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs: http://czyborra.com/utf/#UTF-8 Some example UTF-8 files:
2. Display setup
We assume you have already adapted your Linux console and X11 configuration to your keyboard and locale. This is explained in the Danish/International HOWTO, and in the other national HOWTOs: Finnish, French, German, Italian, Polish, Slovenian, Spanish, Cyrillic, Hebrew, Chinese, Thai, Esperanto. But please do not follow the advice given in the Thai HOWTO, to pretend you were using ISO-8859-1 characters (U0000..U00FF) when what you are typing are actually Thai characters (U0E01..U0E5B). Doing so will only cause problems when you switch to Unicode.
2.1 Linux console
I'm not talking much about the Linux console here, because on those machines on which I don't have xdm running, I use it only to type my login name, my password, and "xinit". Anyway, the kbd-0.99 package ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz and a heavily extended version, the console-tools-0.2.3 package ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-tools-0.2.3.tar.gz contains in the kbd-0.99/src/ (or console-tools-0.2.3/screenfonttools/) directory two programs: `unicode_start' and `unicode_stop'. When you call `unicode_start', the console's screen output is interpreted as UTF-8. Also, the keyboard is put into Unicode mode (see "man kbd_mode"). In this mode, Unicode characters typed as Alt-x1 ... Alt-xn (where x1,...,xn are digits on the numeric keypad) will be emitted in UTF-8. If your keyboard or, more precisely, your normal keymap has non-ASCII letter keys (like the German Umlaute) which you would like to be CapsLockable, you need to apply the kernel patch linux-2.2.9-keyboard.diff or linux-2.3.12-keyboard.diff. You will want to use display characters from different scripts on the same screen. For this, you need a Unicode console font. The ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz and ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-data-1999.08.29.tar.gz packages contain a font (LatArCyrHeb-{08,14,16,19}.psf) which covers Latin, Cyrillic, Hebrew, Arabic scripts. It covers ISO 8859 parts 1,2,3,4,5,6,8,9,10 all at once. To install it, copy it to /usr/lib/kbd/consolefonts/ and execute "/usr/bin/setfont /usr/lib/kbd/consolefonts/LatArCyrHeb-14.psf". A more flexible approach is given by Dmitry Yu. Bolkhovityanov <D.Yu.Bolkhovityanov@inp.nsk.su> in http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html and http://www.inp.nsk.su/~bolkhov/files/fonts/univga/uni-vga.tgz. To work around the constraint that a VGA font can only cover 512 characters simultaneously, he provides a rich Unicode font (2279 characters, covering Latin, Greek, Cyrillic, Hebrew, Armenian, IPA, math symbols, arrows, and more) in the typical 8x16 size and a script which permits to extract any 512 characters as a console font. If you want cut&paste to work with UTF-8 consoles, you need the patch linux-2.3.12-console.diff from Edmund Thomas Grimley Evans and Stanislav Voronyi. In April 2000, Edmund Thomas Grimley Evans <edmundo@rano.org> has implemented an UTF-8 console terminal emulator. It uses Unicode fonts and relies on the Linux frame buffer device.
2.2 X11 Foreign fonts
Don't hesitate to install Cyrillic, Chinese, Japanese etc. fonts. Even if they are not Unicode fonts, they will help in displaying Unicode documents: at least Netscape Communicator 4 and Java will make use of foreign fonts when available. The following programs are useful when installing fonts:
The following fonts are freely available (not a complete list):
2.3 X11 Unicode fonts
Applications wishing to display text belonging to different scripts (like Cyrillic and Greek) at the same time, can do so by using different X fonts for the various pieces of text. This is what Netscape Communicator and Java do. However, this approach is more complicated, because instead of working with `Font' and `XFontStruct', the programmer has to deal with `XFontSet', and also because not all fonts in the font set need to have the same dimensions.
2.4 Unicode xterm
xterm is part of X11R6 and XFree86, but is maintained separately by Tom Dickey. http://www.clark.net/pub/dickey/xterm/xterm.html Newer versions (patch level 146 and above) contain support for converting keystrokes to UTF-8 before sending them to the application running in the xterm, and for displaying Unicode characters that the application outputs as UTF-8 byte sequence. It also contains support for double-wide characters (mostly CJK ideographs) and combining characters, contributed by Robert Brady <robert@suse.co.uk>. To get an UTF-8 xterm running, you need to:
2.5 TrueType fonts
The fonts mentioned above are fixed size and not scalable. For some applications, especially printing, high resolution fonts are necessary, though. The most important type of scalable, high resolution fonts are TrueType fonts. They are currently supported by
Some no-cost TrueType fonts with large Unicode coverage are
Download locations for these and other TrueType fonts can be found at Christoph Singer's list of freely downloadable Unicode TrueType fonts http://www.ccss.de/slovo/unifonts.htm. Truetype fonts are installed similarly to fixed size fonts, except that
they go in a separate directory, and that
TrueType fonts can be converted to low resolution, non-scalable X11 fonts by use of Mark Leisher's ttf2bdf utility ftp://crl.nmsu.edu/CLR/multiling/General/ttf2bdf-2.8-LINUX.tar.gz. For example, to generate a proportional Unicode font for use with cooledit:
More information about TrueType fonts can be found in the Linux TrueType HOWTO http://www.moisty.org/~brion/linux/TrueType-HOWTO.html.
2.6 Miscellaneous
A small program which tests whether a Linux console or xterm is in UTF-8 mode can be found in the ftp://sunsite.unc.edu/pub/Linux/system/keyboards/x-lt-1.24.tar.gz package by Ricardas Cepas, files testUTF-8.c and testUTF8.c. Most applications should not use this, however: they should look at the environment variables, see section "Locale environment variables".
3. Locale setup
3.1 Files & the kernel
You can now already use any Unicode characters in file names. No kernel or file utilities need modifications. This is because file names in the kernel can be anything not containing a null byte, and '/' is used to delimit subdirectories. When encoded using UTF-8, non-ASCII characters will never be encoded using null bytes or slashes. All that happens is that file and directory names occupy more bytes than they contain characters. For example, a filename consisting of five greek characters will appear to the kernel as a 10-byte filename. The kernel does not know (and does not need to know) that these bytes are displayed as greek. This is the general theory, as long as your files stay inside Linux. On filesystems which are used from other operating systems, you have mount options to control conversion of filenames to/from UTF-8:
3.2 Upgrading the C library
glibc-2.2 supports multibyte locales, in particular UTF-8 locales. But glibc-2.1.x and earlier C libraries do not support it. Therefore you need to upgrade to glibc-2.2. Upgrading from glibc-2.1.x is riskless, because glibc-2.2 is binary compatible with glibc-2.1.x (at least on i386 platforms, and except for IPv6). Nevertheless, I recommend to have a bootable rescue disk handy in case something goes wrong. Prepare the kernel sources. You must have them unpacked and configured. /usr/src/linux/include/linux/autoconf.h must exist. Building the kernel is not needed. Retrieve the glibc sources ftp://ftp.gnu.org/pub/gnu/glibc/, su to root, then unpack, build and install it:
Upgrading from glibc versions earlier than 2.1.x cannot be done this way; consider first installing a Linux distribution based on glibc-2.1.x, and then upgrading to glibc-2.2 as described above. Note that if -- for any reason -- you want to rebuild GCC after having installed glibc-2.2, you need to first apply this patch gcc-glibc-2.2-compat.diff to the GCC sources.
3.3 General data conversion
You will need a program to convert your locally (probably ISO-8859-1) encoded texts to UTF-8. (The alternative would be to keep using texts in different encodings on the same machine; this is not fun in the long run.) One such program is `iconv', which comes with glibc-2.2. Simply use
Here are two handy shell scripts, called "i2u" i2u.sh (for ISO to UTF conversion) and "u2i" u2i.sh (for UTF to ISO conversion). Adapt according to your current 8-bit character set. If you don't have glibc-2.2 and iconv installed, you can use GNU recode 3.6 instead. "i2u" i2u_recode.sh is "recode ISO-8859-1..UTF-8", and "u2i" u2i_recode.sh is "recode UTF-8..ISO-8859-1". ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz Or you can also use CLISP instead. Here are "i2u" i2u.lisp and "u2i" u2i.lisp written in Lisp. Note: You need a CLISP version from July 1999 or newer. ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz. Other data conversion programs, less powerful than GNU recode, are `trans' ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz, `tcs' from the Plan9 operating system ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz, and `utrans'/`uhtrans'/`hutrans' ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1.0.tar.gz by G. Adam Stanislav <adam@whizkidtech.net>. For the repeated conversion of files to UTF-8 from different character sets, a semi-automatic tool can be used: to-utf8 presents the non-ASCII parts of a file to the user, lets him decide about the file's original character set, and then converts the file to UTF-8.
3.4 Locale environment variables
You may have the following environment variables set, containing locale names:
man 7 locale ' for a detailed description.)
Each of the LC_* and LANG variables can contain a locale name of the following form:
language[_territory[.codeset]][@modifier] where language is an ISO 639 language code (lower case), territory is an ISO 3166 country code (upper case), codeset denotes a character set, and modifier stands for other particular attributes (for example indicating a particular language dialect, or a nonstandard orthography). LANGUAGE can contain several locale names, separated by colons. In order to tell your system and all applications that you are using UTF-8, you need to add a codeset suffix of UTF-8 to your locale names. For example, if you were using
you would change this to
You do not need to change your LANGUAGE environment variable. GNU gettext in glibc-2.2 has the ability to convert translations to the right encoding.
3.5 Creating the locale support files
You create using
You typically don't need to create locales named "de" or "fr" without country suffix, because these locales are normally only used by the LANGUAGE variable and not by the LC_* variables, and LANGUAGE is only used as an override for LC_MESSAGES.
4. Specific applications
4.1 Shells
bash
By default, GNU bash assumes that every character is one byte long and one column wide. A patch for bash 2.04, by Marcin 'Qrczak' Kowalczyk and Ricardas Cepas, teaches bash about multibyte characters in UTF-8 encoding. bash-2.04-diff Double-width characters, combining characters and bidi are not supported by this patch. It seems a complete redesign of the readline redisplay engine is needed.
4.2 Networking
telnet
In some installations, telnet is not 8-bit clean by default. In order to be able to send Unicode keystrokes to the remote host, you need to set telnet into "outbinary" mode. There are two ways to do this:
and
kermit
The communications program C-Kermit http://www.columbia.edu/kermit/ckermit.html, (an interactive tool for connection setup, telnet, file transfer, with support for TCP/IP and serial lines), in versions 7.0 or newer, understands the file and transfer encodings UTF-8 and UCS-2, and understands the terminal encoding UTF-8, and converts between these encodings and many others. Documentation of these features can be found in http://www.columbia.edu/kermit/ckermit2.html#x6.6.
4.3 Browsers
Netscape
Netscape 4.05 or newer can display HTML documents in UTF-8 encoding. All a document needs is the following line between the <head> and </head> tags:
Netscape 4.05 or newer can also display HTML and text files in UCS-2 encoding with byte-order mark. http://www.netscape.com/computing/download/
Mozilla
Mozilla milestone M16 has much better internationalization than Netscape 4. It can display HTML documents in UTF-8 encoding with support for more languages. Alas, there is a cosmetic problem with CJK fonts: some glyphs can be bigger than the line's height, thus overlapping the previous or next line.
Amaya
Amaya 4.2.1 ( http://www.w3.org/Amaya/, http://www.w3.org/Amaya/User/SourceDist) has now limited handling of UTF-8 encoded HTML pages. It recognizes the encoding, but it displays only ISO-8859-1 and symbol characters; it only ever accesses the fonts
Amaya is in fact a HTML editor, not only a browser. Amaya's strengths among the browsers are its speed, given enough memory, and its rendering of mathematical formulas (MathML support).
lynx
lynx-2.8 has an options screen (key 'O') which permits to set the display character set. When running in an xterm or Linux console in UTF-8 mode, set this to "UNICODE UTF-8". Note that for this setting to take effect in the current browser session, you have to confirm on the "Accept Changes" field, and for this setting to take effect in future browser sessions, you have to enable the "Save options to disk" field and then confirm it on the "Accept Changes" field. Now, again, all a document needs is the following line between the <head> and </head> tags:
When you are viewing text files in UTF-8 encoding, you also need to pass the command-line option "-assume_local_charset=UTF-8" (affects only file:/... URLs) or "-assume_charset=UTF-8" (affects all URLs). In lynx-2.8.2 you can alternatively, in the options screen (key 'O'), change the assumed document character set to "utf-8". There is also an option in the options screen, to set the "preferred document character set". But it has no effect, at least with file:/... URLs and with http://... URLs served by apache-1.3.0. There is a spacing and line-breaking problem, however. (Look at the russian section of x-utf8.html, or at utf-8-demo.txt.) Also, in lynx-2.8.2, configured with --enable-prettysrc, the nice colour scheme does not work correctly any more when the display character set has been set to "UNICODE UTF-8". This is fixed by a simple patch lynx282.diff. The Lynx developers say: "For any serious use of UTF-8 screen output with lynx, compiling with slang lib and -DSLANG_MBCS_HACK is still recommended." Latest stable release: ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz General home page: http://lynx.browser.org/ Newer development shapshots: http://lynx.isc.org/current/, ftp://lynx.isc.org/current/
w3m
w3m by Akinori Ito http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/ is a text mode browser for HTML pages and plain-text files. Its layout of HTML tables, enumerations etc. is much prettier than lynx' one. w3m can also be used as a high quality HTML to plain text converter. w3m 0.1.10 has command line options for the three major Japanese encodings, but can also be used for UTF-8 encoded files. Without command line options, you often have to press Ctrl-L to refresh the display, and line breaking in Cyrillic and CJK paragraphs is not good. To fix this, by Hironori Sakamoto has a patch http://www2u.biglobe.ne.jp/~hsaka/w3m/ which adds UTF-8 as display encoding.
Test pages
Some test pages for browsers can be found at the pages of Alan Wood http://www.hclrss.demon.co.uk/unicode/#links and James Kass http://home.att.net/~jameskass/.
4.4 Editors
yudit
yudit by Gáspár Sinai http://www.yudit.org/ is a first-class unicode text editor for the X Window System. It supports simultaneous processing of many languages, input methods, conversions for local character standards. It has facilities for entering text in all languages with only an English keyboard, using keyboard configuration maps.
yudit-1.5
It can be compiled in three versions: Xlib GUI, KDE GUI, or Motif GUI. Customization is very easy. Typically you will first customize your font. From the font menu I chose "Unicode". Then, since the command "xlsfonts '*-*-iso10646-1'" still showed some ambiguity, I chose a font size of 13 (to match Markus Kuhn's 13-pixel fixed font). Next, you will customize your input method. The input methods "Straight", "Unicode" and "SGML" are most remarkable. For details about the other built-in input methods, look in /usr/local/share/yudit/data/. To change the default for the next session, edit your $HOME/.yuditrc file. The general editor functionality is limited to editing, cut&paste and search&replace. No undo.
yudit-2.1
This version is less easy to learn, because it comes with a homebrewn GUI and no easily accessible help. But it has an undo functionality and should therefore be more usable than version 1.5.
Fonts for yudit
yudit can display text using a TrueType font; see section "TrueType fonts"
above. The Bitstream Cyberbit gives good results. For yudit to find the
font, symlink it to
vim
vim (as of version 6.0r) has good support for UTF-8: when started in an UTF-8 locale, it assumes UTF-8 encoding for the console and the text files being edited. It supports double-wide (CJK) characters as well and combining characters and therefore fits perfectly into UTF-8 enabled xterm. Installation: Download from
http://www.vim.org/.
After unpacking the four parts, call vim can be used to edit files in other encodings. For example, to edit
a BIG5 encoded file:
cooledit
cooledit by Paul Sheer http://www.cooledit.org/ is a good text editor for the X Window System. Since version 3.15, it has support for Unicode, including Bidi for Hebrew (but not Arabic). A build error message message about a missing "vga_setpage" function is worked around by adding "-DDO_NOT_USE_VGALIB" to the CFLAGS. To view UTF-8 files in an UTF-8 locale you have to modify a setting in the "Options -> Switches" panel: Enable the checkbox "Display characters outside locale". I also found it necessary to disable "Spellcheck as you type". For viewing texts with both European and CJK characters, cooledit needs a font which contains both, for example the GNU unifont (see section "X11 Unicode fonts"): Start once
cooledit will then use this font in all future invocations.
Unfortunately, the only characters that can be entered through the keyboard are ISO-8859-1 characters and, through a cooledit specific compose mechanism, ISO-8859-2 characters. Inputing arbitrary Unicode characters in cooledit is possible, but a bit tedious.
emacs
First of all, you should read the section "International Character Set Support" (node "International") in the Emacs manual. In particular, note that you need to start Emacs using the command
so that it will use a font set comprising a lot of international characters.
In the short term, there are two packages for using UTF-8 in Emacs. None of them needs recompiling Emacs.
You can use either of these packages, or both together. The advantages of the emacs-utf "unicode-utf8" encoding are: it loads faster, and it deals better with combining characters (important for Thai). The advantage of the Mule-UCS / oc-unicode "utf-8" encoding is: it can apply to a process buffer (such as M-x shell), not only to loading and saving of files; and it respects the widths of characters better (important for Ethiopian). However, it is less reliable: After heavy editing of a file, I have seen some Unicode characters replaced with U+FFFD after the file was saved. (But maybe that were bugs in Emacs 20.5 and 20.6 which are fixed in Emacs 20.7.) To install the emacs-utf package, compile the program "utf2mule" and install it somewhere in your $PATH, also install unicode.el, muleuni-1.el, unicode-char.el somewhere. Then add the lines
to your $HOME/.emacs file. To activate any of the font sets, use the Mule
menu item "Set Font/FontSet" or Shift-down-mouse-1. The Unicode coverage
may of the font sets at different sizes may depend on the installed fonts;
here are screen shots at various sizes of UTF-8-demo.txt (
12,
13,
14,
15,
16,
18)
and of the Mule script examples (
12,
13,
14,
15,
16,
18).
To designate a font set as the initial font set for the first frame at startup,
uncomment the set-default-font line in the code snippet above.
To install the oc-unicode package, execute the command
and install the resulting file un-define.elc , as well as
oc-unicode.el , oc-charsets.el , oc-tools.el ,
somewhere. Then add the lines
to your $HOME/.emacs file. You can choose your appropriate font set as with
the emacs-utf package.
In order to open an UTF-8 encoded file, you will type
or
(or utf-8 instead of unicode-utf8, if you prefer oc-unicode/Mule-UCS).
In order to start a shell buffer with UTF-8 I/O, you will type
(This works with oc-unicode/Mule-UCS only.)
There is a newer version Mule-UCS-0.81. Unfortunately you need to rebuild emacs from source in order to use it. Note that all this works with Emacs 20 in windowing mode only, not in terminal mode. None of the mentioned packages works in Emacs 21, as of this writing. Richard Stallman plans to add integrated UTF-8 support to Emacs in the long term, and so does the XEmacs developers group.
xemacs
(This section is written by Gilbert Baumann.) Here is how to teach XEmacs (20.4 configured with MULE) the UTF-8 encoding. Unfortunately you need its sources to be able to patch it. First you need these files provided by Tomohiko Morioka: http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.0-b55-emc-b55-ucs.diff and http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-0.1.tar.gz The .diff is a diff against the C sources. The tar ball is elisp code, which provides lots of code tables to map to and from Unicode. As the name of the diff file suggests it is against XEmacs-21; I needed to help `patch' a bit. The most notable difference to my XEmacs-20.4 sources is that file-coding.[ch] was called mule-coding.[ch]. For those unfamilar with the XEmacs-MULE stuff (as I am) a quick guide: What we call an encoding is called by MULE a `coding-system'. The most important commands are:
and the variable `file-coding-system-alist', which guides `find-file' to guess the encoding used. After stuff was running, the very first thing I did was this. This code looks into the special mode line introduced by -*- somewhere in the first 600 bytes of the file about to opened; if now there is a field "Encoding: xyz;" and the xyz encoding ("coding system" in Emacs speak) exists, choose that. So now you could do e.g.
and XEmacs goes into utf-8 mode here. Atfer everything was running I defined \u03BB (greek lambda) as a macro like:
nedit
xedit
With XFree86-4.0.1, xedit is able to edit UTF-8 files if you set the locale accordingly (see above), and add the line "Xedit*international: true" to your $HOME/.Xdefaults file.
axe
As of version 6.1.2, aXe supports only 8-bit locales. If you add the line "Axe*international: true" to your $HOME/.Xdefaults file, it will simply dump core.
pico
As of version 4.30, pine cannot be reasonably used to view or edit UTF-8 files. In UTF-8 enabled xterm, it has severe redraw problems.
mined98
mined98 is a small text editor by Michiel Huisjes, Achim Müller and Thomas Wolff. http://www.inf.fu-berlin.de/~wolff/mined98.tar.gz It lets you edit UTF-8 or 8-bit encoded files, in an UTF-8 or 8-bit xterm. It also has powerful capabilities for entering Unicode characters. mined lets you edit both 8-bit encoded and UTF-8 encoded files. By default
it uses an autodetection heuristic. If you don't want to rely on heuristics,
pass the command-line option mined knows about double-width and combining characters and displays them correctly. It also has a special display mode for combining characters. mined also has a scrollbar and very nice pull-down menus. Alas, the "Home", "End", "Delete" keys do not work.
qemacs
qemacs 0.2 is a small text editor by Fabrice Bellard. http://www-stud.enst.fr/~bellard/qemacs/ with Emacs keybindings. It runs in an UTF-8 console or xterm, and can edit both 8-bit encoded and UTF-8 encoded files. It still has a few rough edges, but further development is underway.
4.5 Mailers
MIME: RFC 2279 defines UTF-8 as a MIME charset, which can be transported under the 8bit, quoted-printable and base64 encodings. The older MIME UTF-7 proposal (RFC 2152) is considered to be deprecated and should not be used any further. Mail clients released after January 1, 1999, should be capable of sending and displaying UTF-8 encoded mails, otherwise they are considered deficient. But these mails have to carry the MIME labels
Simply piping an UTF-8 file into "mail" without caring about the MIME labels
will not work.
Mail client implementors should take a look at http://www.imc.org/imc-intl/ and http://www.imc.org/mail-i18n.html. Now about the individual mail clients (or "mail user agents"):
pine
The situation for an unpatched pine version 4.30 is as follows. Pine does not do character set conversions. But it allows you to view UTF-8 mails in an UTF-8 text window (Linux console or xterm). Normally, Pine will warn about different character sets each time you view an UTF-8 encoded mail. To get rid of this warning, choose S (setup), then C (config), then change the value of "character-set" to UTF-8. This option will not do anything, except to reduce the warnings, as Pine has no built-in knowledge of UTF-8. Also note that Pine's notion of Unicode characters is pretty limited: It will display Latin and Greek characters, but not other kinds of Unicode characters. A patch by Robert Brady <robert@suse.co.uk> http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-0.1.diff adds UTF-8 support to Pine. With this patch, it decodes and prints headers and bodies properly. The patch depends on the GNOME libunicode http://cvs.gnome.org/lxr/source/libunicode/. However, alignment remains broken in many places; replying to a mail does not cause the character set to be converted as appropriate; and the editor, pico, cannot deal with multibyte characters.
kmail
kmail (as of KDE 1.0) does not support UTF-8 mails at all.
Netscape Communicator
Netscape Communicator's Messenger can send and display mails in UTF-8 encoding, but it needs a little bit of manual user intervention. To send an UTF-8 encoded mail: After opening the "Compose" window, but before starting to compose the message, select from the menu "View -> Character Set -> Unicode (UTF-8)". Then compose the message and send it. When you receive an UTF-8 encoded mail, Netscape unfortunately does not display it in UTF-8 right away, and does not even give a visual clue that the mail was encoded in UTF-8. You have to manually select from the menu "View -> Character Set -> Unicode (UTF-8)". For displaying UTF-8 mails, Netscape uses different fonts. You can adjust your font settings in the "Edit -> Preferences -> Fonts" dialog; choose the "Unicode" font category.
emacs (rmail, vm)
mutt
mutt-1.2.x, as available from http://www.mutt.org/, has only rudimentary support for UTF-8: it can convert from UTF-8 into an 8-bit display charset. The mutt-1.3.x development branch also supports UTF-8 as the display charset, so you can run Mutt in an UTF-8 xterm, and has thorough support for MIME and charset conversion (relying on iconv).
exmh
exmh 2.1.2 with Tk 8.4a1 can recognize and correctly display UTF-8 mails
(without CJK characters) if you add the following lines to your
4.6 Text processing
groff
groff 1.16.1, the GNU implementation of the traditional Unix text processing
system troff/nroff, can output UTF-8 formatted text. Simply use
`
TeX
The teTeX 0.9 (and newer) distribution contains an Unicode adaptation of TeX, called Omega ( http://www.gutenberg.eu.org/omega/, ftp://ftp.ens.fr/pub/tex/yannis/omega). Together with the unicode.tex file contained in utf8-tex-0.1.tar.gz it enables you to use UTF-8 encoded sources as input for TeX. A thousand of Unicode characters are currently supported. All that changes is that you run `omega' (instead of `tex') or `lambda' (instead of `latex'), and insert the following lines at the head of your source input.
Other maybe related links: http://www.dante.de/projekte/nts/NTS-FAQ.html, ftp://ftp.dante.de/pub/tex/language/chinese/CJK/.
4.7 Databases
PostgreSQL
PostgreSQL 6.4 or newer can be built with the configuration option
Interbase
Borland/Inprise's Interbase 6.0 can store string fields in UTF-8 format if the option "CHARACTER SET UNICODE_FSS" is given.
4.8 Other text-mode applications
less
With http://www.flash.net/~marknu/less/less-358.tar.gz you can browse UTF-8 encoded text files in an UTF-8 xterm or console. Make sure that the environment variable LESSCHARSET is not set (or is set to utf-8). If you also have a LESSKEY environment variable set, also make sure that the file it points to does not define LESSCHARSET. If necessary, regenerate this file using the `lesskey' command, or unset the LESSKEY environment variable.
lv
lv-4.49.3 by Tomio Narita http://www.ff.iij4u.or.jp/~nrt/lv/ is a file viewer with builtin character set converters. To view UTF-8 files in an UTF-8 console, use "lv -Au8". But it can also be used to view files in other CJK encodings in an UTF-8 console. There is a small glitch: lv turns off xterm's cursor and doesn't turn it on again.
expand
Get the GNU textutils-2.0 and apply the patch textutils-2.0.diff, then configure, add "#define HAVE_FGETWC 1", "#define HAVE_FPUTWC 1" to config.h. Then rebuild.
col, colcrt, colrm, column, rev, ul
Get the util-linux-2.9y package, configure it, then define ENABLE_WIDECHAR in defines.h, change the "#if 0" to "#if 1" in lib/widechar.h. In text-utils/Makefile, modify CFLAGS and LDFLAGS so that they include the directories where libutf8 is installed. Then rebuild.
figlet
figlet 2.2 has an option for UTF-8 input: "figlet -C utf8"
Base utilities
The Li18nux list of commands and utilities that ought to be made interoperable with UTF-8 is as follows. Useful information needs to get added here; I just didn't get around it yet :-) As of glibc-2.2, regular expressions only work for 8-bit characters. In an UTF-8 locale, regular expressions that contain non-ASCII characters or that expect to match a single multibyte character with "." do not work. This affects all commands and utilities listed below.
4.9 Other X11 applications
Owen Taylor is currently developing a library for rendering multilingual text, called pango. http://www.labs.redhat.com/~otaylor/pango/, http://www.pango.org/.
5. Printing
Since Postscript itself does not support Unicode fonts, the burden of Unicode support in printing is on the program creating the Postscript document, not on the Postscript renderer. The existing Postscript fonts I've seen - .pfa/.pfb/.afm/.pfm/.gsf - support only a small range of glyphs and are not Unicode fonts.
5.1 Printing using TrueType fonts
Both the uniprint and wprint programs produce good printed output for Unicode plain text. They require a TrueType font; see section "TrueType fonts" above. The Bitstream Cyberbit gives good results.
uniprint
The "uniprint" program contained in the yudit package can convert a text
file to Postscript. For uniprint to find the Cyberbit font, symlink it to
wprint
The "wprint" (WorldPrint) program by Eduardo Trapani http://ttt.esperanto.org.uy/programoj/angle/wprint.html postprocesses Postscript output produced by Netscape Communicator or Mozilla from HTML pages or plain text files. The output is nearly perfect; only in Cyrillic paragraphs the line breaking is incorrect: the lines are only about half as wide as they should be.
Comparison
For plain text, uniprint has a better overall layout. On the other hand, only wprint gets Thai output correct.
5.2 Printing using fixed-size fonts
Generally, printing using fixed-size fonts does not give an as professional output as using TrueType fonts.
txtbdf2ps
The txtbdf2ps 0.7 program by Serge Winitzki http://members.linuxstart.com/~winitzki/txtbdf2ps.html converts a plain text file to Postscript, by use of a BDF font. Installation:
Example with a proportional font:
Example with a fixed-width font:
Note: txtbdf2ps does not support combining characters and bidi.
5.3 The classical approach
Another way to print with TrueType fonts is to convert the TrueType font to
a Postscript font using the
TeX, Omega
TODO: CJK, metafont, omega, dvips, odvips, utf8-tex-0.1
DocBook
TODO: db2ps, jadetex
groff -Tps
"groff -Tps" produces Postscript output. Its Postscript output driver supports only a very limited number of Unicode characters (only what Postscript supports by itself).
5.4 No luck with...Netscape's "Print..."
As of version 4.72, Netscape Communicator cannot correctly print HTML pages in UTF-8 encoding. You really have to use wprint.
Mozilla's "Print..."
As of version M16, printing of HTML pages is apparently not implemented.
html2ps
As of version 1.0b1, the html2ps HTML to Postscript converter does not support UTF-8 encoded HTML pages and has no special treatment of fonts: the generated Postscript uses the standard Postscript fonts.
a2ps
As of version 4.12, a2ps doesn't support printing UTF-8 encoded text.
enscript
As of version 1.6.1, enscript doesn't support printing UTF-8 encoded text. By default, it uses only the standard Postscript fonts, but it can also include a custom Postscript font in the output.
6. Making your programs Unicode aware
6.1 C/C++
The C `
For normal text handling
The ISO/ANSI C standard contains, in an amendment which was added in 1995,
a "wide character" type ` Good references for this API are
Advantages of using this API:
Drawbacks of this API:
Portability notes
A ` In detail, here is what the
Single Unix specification
says about the ` One particular consequence is that in portable programs you shouldn't use
non-ASCII characters in string literals. That means, even though you
know the Unicode double quotation marks have the codes U+201C and U+201D,
you shouldn't write a string literal Here is a survey of the portability of the ISO/ANSI C facilities on various Unix flavours.
As a consequence, I recommend to use the restartable and multithread-safe wcsr/mbsr functions, forget about those systems which don't have them (Irix, HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below) on those systems which permit you to compile programs which use these wcsr/mbsr functions (Linux, Solaris, OSF/1). A similar advice, given by Sun in http://www.sun.com/software/white-papers/wp-unicode/, section "Internationalized Applications with Unicode", is: To properly internationalize an application, use the following guidelines:
If, for some reason, in some piece of code, you really have to assume that
`wchar_t' is Unicode (for example, if you want to do special treatment of
some Unicode characters), you should make that piece of code conditional
upon the result of
The libutf8 library
A portable implementation of the ISO/ANSI C API, which supports 8-bit locales and UTF-8 locales, can be found in libutf8-0.7.3.tar.gz. Advantages:
The Plan9 way
The Plan9 operating system, a variant of Unix, uses UTF-8 as character
encoding in all applications. Its wide character type is called
` Drawback of this API:
For graphical user interface
The Qt-2.0 library http://www.troll.no/ contains a fully-Unicode QString class. You can use the member functions QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text. The QString::ascii and QString::latin1 member functions should not be used any more.
For advanced text handling
The previously mentioned libraries implement Unicode aware versions of the ASCII concepts. Here are libraries which deal with Unicode concepts, such as titlecase (a third letter case, different from uppercase and lowercase), distinction between punctuation and symbols, canonical decomposition, combining classes, canonical ordering and the like.
For conversion
Two kinds of conversion libraries, which support UTF-8 and a large number of 8-bit character sets, are available:
iconv
The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.2. ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz. The iconv manpages are now contained in ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz. The portable iconv implementation by Bruno Haible. ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz The portable iconv implementation by Konstantin Chuguev. <joy@urc.ac.ru> ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz Advantages:
librecode
librecode by François Pinard ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz. Advantages:
Drawbacks:
ICU
International Components for Unicode 1.7
http://oss.software.ibm.com/icu/.
IBM's internationalization library also has conversion facilities, declared
in ` Advantages:
Drawbacks:
Other approaches
6.2 Java
Java has Unicode support built into the language. The type `char' denotes a Unicode character, and the `java.lang.String' class denotes a string built up from Unicode characters. Java can display any Unicode characters through its windowing system AWT, provided that 1. you set the Java system property "user.language" appropriately, 2. the /usr/lib/java/lib/font.properties.language font set definitions are appropriate, and 3. the fonts specified in that file are installed. For example, in order to display text containing japanese characters, you would install japanese fonts and run "java -Duser.language=ja ...". You can combine font sets: In order to display western european, greek and japanese characters simultaneously, you would create a combination of the files "font.properties" (covers ISO-8859-1), "font.properties.el" (covers ISO-8859-7) and "font.properties.ja" into a single file. ??This is untested?? The interfaces java.io.DataInput and java.io.DataOutput have methods called `readUTF' and `writeUTF' respectively. But note that they don't use UTF-8; they use a modified UTF-8 encoding: the NUL character is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at the end. Encoded this way, strings can contain NUL characters and nevertheless need not be prefixed with a length field - the C <string.h> functions like strlen() and strcpy() can be used to manipulate them.
6.3 Lisp
The Common Lisp standard specifies two character types: `base-char' and `character'. It's up to the implementation to support Unicode or not. The language also specifies a keyword argument `:external-format' to `open', as the natural place to specify a character set or encoding. Among the free Common Lisp implementations, only CLISP
http://clisp.cons.org/
supports Unicode. You need a CLISP version from March 2000 or newer.
ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.
The types `base-char' and `character' are both equivalent to 16-bit Unicode.
The functions Among the commercial Common Lisp implementations: LispWorks http://www.xanalys.com/software_tools/products/ supports Unicode. The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char' (subtype of `character') contains all Unicode characters. The encoding used for file I/O can be specified through the `:external-format' argument, for example '(:UTF-8). Limitations: Encodings cannot be used for socket I/O. The editor cannot edit UTF-8 encoded files. Eclipse http://www.elwood.com/eclipse/eclipse.htm supports Unicode. See http://www.elwood.com/eclipse/char.htm. The type `base-char' is equivalent to ISO-8859-1, and the type `character' contains all Unicode characters. The encoding used for file I/O can be specified through a combination of the `:element-type' and `:external-format' arguments to `open'. Limitations: Character attribute functions are locale dependent. Source and compiled source files cannot contain Unicode string literals. The commercial Common Lisp implementation Allegro CL, in version 6.0, has
Unicode support. The types `base-char' and `character' are both equivalent
to 16-bit Unicode. The encoding used for file I/O can be specified through the
`:external-format' argument, for example
6.4 Ada95
Ada95 was designed for Unicode support and the Ada95 standard library features special ISO 10646-1 data types Wide_Character and Wide_String, as well as numerous associated procedures and functions. The GNU Ada95 compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of wide characters. This allows you to use UTF-8 in both source code and application I/O. To activate it in the application, use "WCEM=8" in the FORM string when opening a file, and use compiler option "-gnatW8" if the source code is in UTF-8. See the GNAT ( ftp://cs.nyu.edu/pub/gnat/) and Ada95 ( ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm) reference manuals for details.
6.5 Python
Python 2.0
(
http://www.python.org/2.0/,
http://www.python.org/pipermail/python-announce-list/2000-October/000889.html,
http://starship.python.net/crew/amk/python/writing/new-python/new-python.html)
contains Unicode support. It has a new fundamental data type
`unicode', representing a Unicode string, a module `unicodedata' for the
character properties, and a set of converters for the most important encodings.
See
http://starship.python.net/crew/lemburg/unicode-proposal.txt,
or the file
6.6 JavaScript/ECMAscript
Since JavaScript version 1.3, strings are always Unicode. There is no character type, but you can use the \uXXXX notation for Unicode characters inside strings. No normalization is done internally, so it expects to receive Unicode Normalization Form C, which the W3C recommends. See http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode for details and http://developer.netscape.com/docs/javascript/e262-pdf.pdf for the complete ECMAscript specification.
6.7 Tcl
Tcl/Tk started using Unicode as its base character set with version 8.1. Its internal representation for strings is UTF-8. It supports the \uXXXX notation for Unicode characters. See http://dev.scriptics.com/doc/howto/i18n.html.
6.8 Perl
Perl 5.6 stores strings internally in UTF-8 format, if you write
at the beginning of your script. length() returns the number of
characters of a string. For details, see the Perl-i18n FAQ at
http://rf.net/~james/perli18n.html.
Support for other (non-8-bit) encodings is available through the iconv interface module http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz.
6.9 Related reading
Tomohiro Kubota has written an introduction to internationalization http://www.debian.org/doc/manuals/intro-i18n/. The emphasis of his document is on writing software that runs in any locale, using the locale's encoding.
7. Other sources of information
7.1 Mailing lists
Broader audiences can be reached at the following mailing lists. Note that where I write `at', you should write `@'. (Anti-spam device.)
linux-utf8
Address: This mailing list is about internationalization with Unicode, and covers a broad range of topics from the keyboard driver to the X11 fonts. Archives are at http://mail.nl.linux.org/linux-utf8/. To subscribe, send a message to
li18nux
Address: This mailing list is focused on organizing internationalization work on Linux, and arranging meetings between people. To subscribe, fill in the form at http://www.li18nux.org/
and send it to
unicode
Address: This mailing list is focused on the standardization and continuing development of the Unicode standard, and related technologies, such as Bidi and sorting algorithms. Archives are at ftp://ftp.unicode.org/Public/MailArchive/, but they are not regularly updated. For subscription information, see http://www.unicode.org/unicode/consortium/distlist.html.
X11 internationalization
Address: This mailing list addresses the people who work on better internationalization of the X11/XFree86 system. Archives are at http://devel.xfree86.org/archives/i18n/. To subscribe, send mail to the friendly person at
X11 fonts
Address: This mailing list addresses the people who work on Unicode fonts and the font subsystem for the X11/XFree86 system. Archives are at http://devel.xfree86.org/archives/fonts/. To subscribe, send mail to the overworked person at
|
![]() |