Re: unicode

Theodore Y. Ts'o (tytso@MIT.EDU)
Tue, 19 May 1998 00:45:59 -0400

Date: Sat, 16 May 1998 13:04:00 +0200 (MET DST)
From: (Guest section DW)

Just for your entertainment, have you read
POSIX (ISO/IEC 9945-1: 1996) B.2.3.4 (5)?
(Don't be afraid, it is not a prescription, it is just a
discussion about common usage, where it is remarked that
many Unix systems use filenames in several character sets,
sometimes even a single filename uses several character sets.

Yes, and in B.2.2.2, lines 1024--1030, it states that use of character
sets beyond "the portable character set or ISO/IEC 646" is "common", but
"technically noncompliant".

[Americans tend to underestimate the enormous cost in time
and money of a conversion. Every American would consider
a proposal to convert all filenames to EBCDIC ridiculous,
just impossible, but now that ASCII and UTF-8 happen to
coincide and Americans can convert for free, they talk
easily about the horrors they plan to inflict on the rest
of the world. Fortunately, for the time being, these plans
look like empty words.]

I'm certainly willing to allocate a bit in the directory entry to help
deal with the conversion issues with folks who have been using the
POSIX.1 non-compliant approach of just storing high-eight-bit characters
in their ext2 filesystems, so that we can distinguish between entries
where folks used the non-complaint-but-expedient approach of just using
their local character set, from directory entries using UTF-8 to encode
ISO/IEC 646 characters.

As far as I know, people who are doing this today aren't labeling their
filesystems; they are just using some local character set. They are
certainly not storing multiple character sets in a single filename,
because there's no way to distinguish which character set to use, and
any such labelling scheme is certainly non-standard.

So it's fair to provide a compatibility/upgrade path for folks who did
things the old, bad way, so that they can move to the new way which
really *does* allow a filename to contain characters from multiple
character sets (by using UTF-8 encoded ISO/IEC 646 characters). But to
try to invent another mechanism for doing character set switches inside
filenames is simply madness. (Such schemes usually only limit you to a
limited number of character sets anyway.)

Yes, it's hard. But internationalization in general is hard. Hacks
which make a particular Linux system to be "Russian-only", or
"German-only" is easier (which is basically what "just use your local
charset is really all about"), but that's not really an international
version of Linux. That's just a version of Linux which specific for a
specific country, and is no better than a English-centric Linux OS.

- Ted

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to