Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Michael Poole (poole+@andrew.cmu.edu)
Tue, 26 Aug 1997 12:58:44 -0400 (EDT)


As a preface: Alex, your rants aren't going to convert very many
people to your side; while not very many people (for or against Unicode
support .. where? in the kernel? is that what this is about?) are
arguing in depth with facts, the volume of your vitriol exceeds others'.

The main point I'd like to make, though, is this: this is the
linux-kernel mailing list; we should try to restrict ourselves to
discussions pertinent to the kernel. A general debate about the
advantages of Unicode or some other character encoding doesn't need to
take place; the only reason to discuss character set encodings on
linux-kernel are to decide what the kernel should use. For simplicity's
sake, the kernel encoding should have certain features (which I discuss
below), which are generally not provided by native encodings.

On Mon, 25 Aug 1997, Alex Belits wrote:

> On 26 Aug 1997, Kai Henningsen wrote:
>
> > > Typesetting rules are derived from language, and language information is
> > > present in native encoding + metatadata, but lost in Unicode.
> >
> > Surprise! Use Unicode+metadata and keep the information. Or use native
> > encoding without metadata and lose it, too. So what's new?
>
> Unicode is supposed to _replace_ metadata and be still complete. This is a
> lie. Native encodings require metadata to be used together, and that
> doesn't make them any inferior to Unicode.

Perhaps you haven't read The Unicode Standard, v2.0 (the big black
book). It says *specifically* that you need data outside the Unicode text
to properly handle all issues; several examples are sorting and character
generation (for example, going from Korean jamo to full glyphs, or from
roman character input to Han glyphs). While I don't want to get into a
debate on the relative merits of Unicode versus other encodings, it seems
reasonable for the kernel to support more than 256 characters. In the
interests of a small kernel (both source and binary), whatever encoding is
used should be space-efficient and not require large processing routines
(eg for character begin-end detection). Unicode and UTF-8 provide this,
while still not having excessively large overhead for unusual character
storage; most 'native' encodings that support more than 256 characters
don't have this advantage.

> > > Software doesn't exist because it's impossible to write anything based on
> > > Unicode without losing quality below the level, already provided by
> >
> > You misspelt "gaining quality".
>
> Please learn to read in plain English. Quality is lost, and I know it. How
> you can know that or the opposite with your native language in iso8859-1,
> that Unicode is designed to be absolutely compatible with, I have
> absolutely no idea.
>
> > > It simplifies issues for GUI-writers and creates a nightmare for everyone
> > > else. Of course, Microsoft doesn't care about anything but GUI, but I do.
> >
> > Of course, _nobody_ has presented the slightest shred of evidence that it
> > creates nightmares for anybody,
>
> ...If that "anybody" speaks English and German.

You posted earlier a question asking for someone's authority to
make an assertion which is almost the opposite of what I quote above.
However, you don't bother to explain why you have authority to complain
about lost quality.

For the problems with supporting wide-character encodings in
userspace, I'll refer any interested parties to the debate/flame war which
Alex Belits and I and several other people participated in earlier this
year on comp.os.linux.development.system, but this is the *linux-kernel*
list, and user-space issues generally aren't relevant here.

In the kernel, I think the decision that needs to be made rests on
these points:
* How efficient is it in terms of encoding text?
- For the present, this 'text' is going to be almost entirely ASCII,
since AFAIK the kernel doesn't involve itself with text inside files.
- In the future, this may change for some users.
* How small is the source and binary code which handles operations on the
text which the kernel needs to do?
- For the most part, the kernel just needs to handle iteration over
characters (in both directions?)
- Console output is an issue -- currently it only supports 256 or 512
characters on PCs, but if something like GGI becomes prevalent, most
users will expect it to support the full range of supported characters.
This means that to display Han characters you'll need a large bitmap
table containing them, but this can be arranged to be swapped out.
- Input is another issue, but I don't feel qualified to comment on it;
I don't have any idea how it's currently handled or how
foreign-language input methods generally (or "should") work.

> > and there is some evidence that it
> > actually eliminates such nightmares (such as supporting two dozen
> > different, incompatible character sets, maybe with an abomination like ISO
> > 2022 as "solution", and not covering half as much territory).
>
> Clay tablets support more.. Let's switch to clay tablets.

Gee, that'll make it kind of hard to store characters on disk.
"I've got a 1,000,000-tablet SIMM here, how much do you want for it?"

I've yet to see you argue what encoding(s) should be used, or even
what features they should have, but you seem to be convinced that "native"
(status quo) encodings are better than anything new. Here are my
arguments on why we need something like Unicode or UTF-8 support in the
kernel, as a list of the features required:
* Unambiguous encodings of distinct characters within a language
* Relatively easy to find the begin and end of characters (not loads of
state), since it's bad to store fractional characters eg in a filename
* A single encoding should be used for all character sets -- you wouldn't
want to have to make guesses about the character set something is in,
and thus possibly misdisplay or mishandle the text.

Native encodings don't provide this -- Latin-1 and Latin-2
conflict on some characters, and BIG5 and JIS conflict with each other and
Latin-1 and Latin-2 on more characters.
I've heard that begin-end detection in certain Far-East encodings
is impossible to do unless you start from a known string beginning as
well, but I don't know the details of that.

> > Plus, there's no reason why GUI writers should profit more than anybody
> > else.
>
> You really don't know the difference between buttons-drawing and text
> processing in databases?

I'll point out two things:
* that both of these are problems in user-space, not in the kernel
* you don't provide any support for your argument