Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Darin Johnson (darin@connectnet.com)
Tue, 26 Aug 1997 18:37:45 -0700 (PDT)


> From: Michael Poole <poole+@andrew.cmu.edu>

> In the kernel, I think the decision that needs to be made rests on
> these points:

You forgot an important question:
- Does the kernel even need to concern itself with character encodings?

As far as I can see; all that is needed is ASCII, and even then only
because that's what the various messages use.

The kernel doesn't need to know what charset the filenames are in, it
just needs to leave them alone. And this is a GOOD thing. If one
Linux distributions decides to use Unicode; it can. If another
distribution wants to use SJIS, they can. If another distribution
comes up with a way of handling multiple charsets via escape
sequences, so much the better. All camps should be happy here.

If suddenly it is declared "EXT2 uses UTF-8"; how is it going to
accomplish this? It surely won't go and translate anything coming
from user-space, because it doesn't know what encoding those
characters are. No, instead, such a proclamation would result in zero
code changes. They kernel should just accept what it is handed,
because it can't know enough to convert to/from any official encoding
anyway without user-space help.

Of course, there are some encodings that will give the kernel
problems; ie a "/" could be valid second byte of a multibyte
character (no decent encoding has this problem with null though).
But that's still an issue for user space to handle, fixing up the
problem before the kernel gets to it.

I suspect people may want multilingual console support, and if so,
then this issue should be discussed again, in that context. (if you
can do it, great; I don't know of any systems that can, including
Windows) Instead, the current thread seems to concern only file names...

On the other hand; even though none of these needs to be in the
kernal, and can be all done in user-space, one other question remains.
What *can* be put into the kernel that would be useful and appropriate?

> - Input is another issue, but I don't feel qualified to comment on it;
> I don't have any idea how it's currently handled or how
> foreign-language input methods generally (or "should") work.

These can be entirely userspace. Output might not be, because the
kernel does do output. But there is no direct input to the kernel.
(well, there's lilo command lines, but I doubt anyone going to put
input methods into lilo :-)

> Here are my
> arguments on why we need something like Unicode or UTF-8 support in the
> kernel, as a list of the features required:
> * Unambiguous encodings of distinct characters within a language

Unambigous encodings of a *subset* of distinct characters within a language :-)
(unicode has 20K different "Han" characters, which leaves a lot of
dictionaries out in the cold)

But you don't say why this is needed in the *kernel*.

> * Relatively easy to find the begin and end of characters (not loads of
> state), since it's bad to store fractional characters eg in a filenam=
> e

True. But for most native encodings, this is also true (especially if
all you look for are "/" or "\0").

> * A single encoding should be used for all character sets -- you wouldn=
> 't
> want to have to make guesses about the character set something is in,
> and thus possibly misdisplay or mishandle the text.

Well, Unicode doesn't really solve this problem either. You *can*
misdisplay with unicode. That web page mentioned yesterday was
enlightening.

But again, the kernel doesn't have to make a guess, it just accepts
day from user space, and regurgitates it back again.

> From: "Svein Erik Brostigen" <SveinErik.Brostigen@ksr.okpost.telemax.no=

> I, for one, would love to be able tohave both japanese,korean, thai and=
> =20
> norwegian characters on the screen at the saem time and without any tri=
> cky=20
> stuff to make this possible.

Ironically, I can do this in MULE, which isn't unicode, but I can NOT
do this in Windows NT, which is unicode.

> From: "Richard B. Johnson" <root@analogic.com>

> Some primative Operating Systems provide for 'Code Pages' which are
> Language-specific chunks of code that interface into stdin, stdout, and
> stderr. This is 'user mode' stuff.

But it's also what people *use*. I was amazed when our product added
code page support, thinking it was archaic, but then adding things the
customer can use sells more licenses.

Also, this stuff may theoretically be user-space, but it does get
into kernel code quite often. Note that the linux HPFS code
shows that it uses code pages. (come to think of it, has any
OS used code pages that wasn't from IBM or partially controlled
by IBM?)

The drawbacks of code pages is that they are very European oriented,
and you can't easily mix and match code pages within a single
document. The advantages is that it's simple and easy and millions
of systems use it.