Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberi=

Matthias Urlichs (smurf@lap.noris.de)
27 Aug 1997 04:13:52 +0200


Alex Belits <abelits@phobos.illtel.denver.co.us> writes:
> On 26 Aug 1997, Kai Henningsen wrote:
>
> > Sort order is important. But cultural sort order (as opposed to any odd
> > sort order) _cannot_ be done via naked byte order and picking the right
> > character set. It's not even possible for English - you want to sort
> >
> > Andy
> > boring
> > John
> >
> > and no naked byte order will ever give you this.
>
> You don't have a clue.
>
??? Of course he has a clue. You _cannot_ sort characters via any byte
order. At the very least you have to map upper->lower case. You can only do
that right in ASCII and some stupid national variants (they're stupid
because you can't put more than one language into one document -- try
mixing the German idiocy of usurping ASCII []{}\| for umlaut characters
with C source code -- oh, so you want to use trigraphs ???).

There's also the question of what you want to achieve. Shall capital letters
be wholly distinct? Ignored? Be used for some sort of secondary ordering?
Same with umlauts, diacriticals, and what-have-you (which also can be split
into their secondary forms when sorting, as with ä -> ae, or maybe even
(c) -> "copyright").

Face it, the only way to do this right is via some generic mechanism, and
as soon as you have that mechanism it's irrelevant whether the character
set you use manages to place A B C in the right order or not.

> Lie. Windows has unicode support that is mainly broken and unused -- this
> is why it has "localized" versions (that will be absolutely pointless if
> it was really internationalized like Unicode's use claims to make
> possible).
>
Great. But is that a fault of Microsoft or of Unicode??

I'd suspect the former...

> > That's why you want to standardize those on UTF-8. You _don't_ want to
> > have the FS have different names in different character sets.
>
> Why do you know what others want? You don't even speak their languages.
>
There are two alternatives here which would actually work. You can display
names from non-local character sets via some sort of machine-readable
transliteration, maybe UTF-7 so that you can actually type the thing, or
you can display them in their native form and depend on the user to figure
out for themselves where the Greek Alpha (or whatever) is.

A third way would be some sort of human-readable transliteration, but
you'll have to be careful with aliasing -- what if two different names get
transliterated to the same string?

There are also alternatives which won't work. Inserting the disk from
my Greek friend into the Russian friend's disk drive and having the
filenames show up in some jumble of nonunderstandable Cyrillic letters is
Not An Option (it gets worse with multibyte characters -- "sorry, but this
character doesn't exist in Klingonese, so you can't type it, thus you can't
open this file" (insert appropriate Klingon insult, then wipe the screen
clean please ;-) (yes I do know that you don't need multibyte characters
for Klingon, this is an example).)

Marking the disk as "On this disk, all names are Cyrillic" and another disk
as "Greek" and another as "Big5" and another as ... doesn't make sense
either. What shall a multilingual translator do, one hard disk per
language??

You can pick nits with Unicode all you like, but please, if you want to
replace it, offer us some alternative which actually can be made to work
for everybody and which isn't just another 80% (or even 99%) non-solution.

Note that Unicode does have its flaws. I'm not too happy about the fact
that they mixed up umlauted and diaresised(??) characters, but you got to
draw the line _somewhere_, or else they'll give us different character
codes for the distinct lower-case-g and -a glyphs next. :-( Which BTW
just shows that non-Latin-1 languages don't have a monopoly on that kind of
difficulty.

-- 
Matthias Urlichs