Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Kai Henningsen (kai@khms.westfalen.de)
26 Aug 1997 03:32:00 +0200


abelits@phobos.illtel.denver.co.us (Alex Belits) wrote on 21.08.97 in <Pine.LNX.3.95.970821153948.16080C-100000@phobos.illtel.denver.co.us>:

> On 21 Aug 1997, H. Peter Anvin wrote:

> > Actually, UTF-8 is infinitely expandable, it is just not defined (yet
> )
> > above 31 bits.
>
> So is BASE64, uuencode and dump in octal. That doesn't make them more
> acceptable.

That turns out not to be the case. (Actually, both HPA's and yours.)

UTF-8 is easily expandable to 2^36, which is a lot more than what we might
need in the forseeable future, even if we happen to make contact with
several million alien species using as many characters as we do.

None of these is infinitely expandable. Not that it matters. They already
allow ridiculous numbers.

Except, that is, that base64, uuencode, or octal don't specify any
character set definitions (they're just ways to represent any odd binary
data), and UTF-8 does.

> There is nothing nationalistic in distinguishing between similarly-look
> ing
> characters that belong to different languages, have different meaning
> and usage and may be written/typesetted differently. No one proposed to
> make cyrillic "á" and Roman "A" the same character, even though when I
> write this, both of them look exactly the same -- why others should do
> that?

Very simple reason - the round trip principle. On designing Unicode, the
rule was that for every ISO standard existing at that time (and some
vendor standards, too),
1. every character in that standard should go into Unicode
2. It should be possible, without losing any information, to translate
text in that standard into Unicode and back again
3. Otherwise, characters should be unified

Anyway, the Han Unification was done by the East Asians themselves.

As to typesetting differences, that's not what Unicode is about. As to
languages, yes, there _is_ unification of characters from different
languages even in the latin part of Unicode.

MfG Kai