Re: unicode (char as abstract data type)

Theodore Y. Ts'o (tytso@MIT.EDU)
Mon, 20 Apr 1998 09:13:40 -0400


Those who prefer charset labeling (such as Alex Belits) are worried
about encoding efficiency (especially Alex who doesn't like the fact
that Russian didn't get some of the 1 byte UTF-8 assignments.) This is
certainly a consideration. However, charset labeling can get
*extremely* complex and messy, especially if you want to store
characaters from multiple character sets in the same document or
filesystem.

For example, suppose you have a dialup system with customers logging
into your system from all over the world. Suppose further than the
Russions what to use filenames with Cyrillic characters, the Chinese
want to use the Han characters, the Europeans want to use ISO Latin 1,
etc. Clearly, it's not sufficient to put a charset label in the
superblock. You need to put a character set label in every file, or
perhaps even put some kind of escape sequence processing if you want to
be able to support both, say, Kanji and ISO Latin 1 in the same file.
Worse yet, if you want to display such characters, you now need to tell
the console how to interpret the application-specific escape sequences.
None of the charset labelling folks have defined a universal escape
sequence for changing between charsets; fundamentally, they assume that
all processing on a particular machine will be done in a single
character set. This gets problematical as soon as you observe that many
documents need to support characters in multiple character sets, and it
completely breaks down in client/server applications where people may be
communicating over the network in multiple languages.

There is also the backwards compatibility issue; how do you handle
existing ext2 filesystems that are currently using ASCII, and a lot of
existing code which assumes that the '/' and '\0' characters have
special meaning. For that reason, the only thing which makes sense for
the ext2 filesystem is to declare that filenames and volume labels are
in UTF-8.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu