Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Darin Johnson (darin@connectnet.com)
Wed, 27 Aug 1997 14:03:46 -0700 (PDT)


> From: hpa@transmeta.com (H. Peter Anvin)
>
> > In the naive approach of using private-use planes, some problem can be
> > solved, yes, each person can use his/her own character set(s).
> > However, speaking of information interchange, we have to send
> > information about the character set itself along with text.
> > Then, it seems for me that it's multiple character sets system in fact.

> I guess I don't understand what you are talking about here. All I'm
> saying is that if you have a character set which is not supported by
> ISO 10646, there is plenty of space in UCS-4 to map it.

Ok, the problem is, each person, or group, needs their own private
extensions to the charsets. This makes information interchange
difficult. This is because you have to ensure that everyone you send
your document to understands your private-use plane! Now if you
standardize within a country, then you've essentially created a new
charset, with unicode as the base; but the whole point of unicode was
to avoid multiple characters sets in the first place.

That's essentially what he's saying - private character sets are
nearly useless for communication; and if used you end up with multiple
character sets all over again.

> Basically
> you're using the codepoint, say, U+000F0000, to mean "character 0 in
> myspiffycharacterset-1". Then you are carrying along the information
> of which character set it comes from.

"Myspiffycharacterset" is the problem. The government of a major
economic power should not be required to invent a "spiffy" character
set just to send internal memos. You shouldn't need to invent a
spiffy character set to address a letter to someone in Tokyo or
Beijing. You should use the private space to support Klingon, not
Chinese.

So - back to the *kernel*: I would think that even supporters of
unicode, at this point, can see that it is still a very contentious
standard, controversial enough that there is a high possibility of the
standards changing a lot in the future, or the standard being ignored
by lots of people (and an unused standard isn't really a standard
anymore). Thus it's too controversial for standardization inside of
Linux. Leave it to user space libraries and linux distributions.
Later, if unicode does become widely accepted, then think about adding
it in the kernel.

(and for heavens sake, if someone does add it to the kernel; make it
a compile time option!!!)