Re: unicode (char as abstract data type)

Albert D. Cahalan (acahalan@cs.uml.edu)
Fri, 17 Apr 1998 23:27:01 -0400 (EDT)


Alex Belits writes:
> On Fri, 17 Apr 1998, Albert D. Cahalan wrote:

>> It is not dead. The Unicode support in the system allows for a
>> future world without 8-bit apps. The transition may take a decade.
>> When the transition is done, there won't be so much reencoding
>> between apps and the kernel.
>
> In a decade Unicode most likely will be in the same place
> where EBCDIC is now.

That would be KOI-8, used only by Alex Belits.

Look at it this way:

We are stuck in a world with multiple character encodings.
To convert, you generally need to go through UCS2.
The kernel must convert for foreign filesystem support.
The library & apps must convert for many other reasons.
If libc can use UCS2 to call the kernel, then the kernel
only needs to perform half of the conversion and libc won't
need to convert back to UCS2. Put more of it in user-space!

Think of a machine with several users and several filesystems.
Maybe they are all Czech, which Martin Mares reports as having
more than 5 character encodings. Each user wants to see the system
in their preferred encoding. Solution: the kernel reads filenames
from disk in whatever format is there, then converts to UCS2.
The library converts UCS2 into the format which each user wants.

The yucky alternative: the conversion from UCS2 to _one_ local
encoding is also in the kernel and users that don't like the chosen
encoding are screwed: live with it or suffer a _second_ conversion.

>>>> I certainly don't want to see 8-bit kernel calls on Merced.
>>>
>>> Then you won't see vi there either.
>>
>> Oh? The last time I heard, vi accessed system calls via libc.
>
> But in what encoding will it represent that to the terminal?

KOI-8 if you prefer. Except for virtual consoles, this is not
a kernel issue at all.

It is an ncurses issue if you want your consoles in UTF-8 mode.
Note that you could have ncurses support dumb 8-bit apps on
UTF-8 consoles, and it doesn't have to be ASCII or Latin-1.

>> That is the applications. This is the kernel mailing list.
>> We have a library called "libc" that provides an interface
>> between the applications and the kernel. Applications can
>> still see filenames in KOI-8 if you so desire. (you won't
>> care what libc does to non-KOI-8 filenames because you won't
>> have any such names on your disk)
>
> How will it know that it's koi8

Option 1: compile that knowledge into libc
Option 2: use an environment variable that libc interprets

> if charset labeling will be eliminated (and this is the whole
> point of Unicode -- to avoid need of charset labeling by
> providing some flat space)?

No & no.

You don't use charset labeling on your filenames, do you?
Somehow you are able to interpret them anyway.

Unicode can "limp by" without language information just like Latin-1
can limp by without it. Full use of Latin-1 needs language information
for sorting and selection of a desirable font. No difference here!

>> Think about the consequences of UTF-8 at the system call level:
>> Every system call that uses text must be first converted to UTF-8.
>> This burden is with us forever. Meanwhile, Windows and MacOS can
>> avoid conversion costs after the world converts to UCS2.
>>
>> The world _will_ convert too. As much as you may hate it, you
>> must realize that when Sun, Microsoft, and Apple agree...
>> It is only a matter of time -- perhaps a decade.
>
> They say it, but they don't _do_ it -- and they can't do that anyway.

At the kernel level, it's already done. This is of course the
Linux KERNEL mailing list, so only the KERNEL part matters here.

It is not often that those 3 companies agree on anything.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu