Re: unicode (char as abstract data type)

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 21 Apr 1998 09:25:28 -0700 (PDT)


On 21 Apr 1998, Matthias Urlichs wrote:

> Alex Belits <abelits@phobos.illtel.denver.co.us> writes:
> >
> > > > I can _not_ ignore it if it's there. As some discussion in IETF FTP-WG
> > > > demonstrated, in some cases (such as FTP directory) the only way to handle
> > > > unknown charset at the other end of the wire is to asume something about
> > >
> > > Since when did we put FTP service into the kernel?
> >
> > FTP uses kernel system calls last time I have checked. And it doesn't
>
> Nope, FTP uses libc calls. Give it a "site encoding KOI8-R" call and it'll
> tell the libc to transliterate these pesky file names, and bingo you see
> what you always saw.

Really? How does it know? Site has no global encoding, users may have
separate ones. Since FTP has no labeling mechanism, and deals with local
encoding (labeled or not), encoding handling should be completely outside
of protocol. pipes and sockets have no translation or labeling by
themselves, and things work just fine.

> > transfer charset information from remote end. And has no means for
> > that. However NFS (quite kernel-related thing) doesn't know anything about
> > libc,
>
> NFS is a kernel-to-kernel interface. The libc on the client machine is
> expected to do the transliteration, if any.

Then if anything that can transfer files that also can be accessed by
NFS, will do translation I will have lossy round trip, so one better
should leave things transparent.

> > What if I use multiple charsets and don't want kernel to meddle? Or I
> > use database?
> >
> You can't use multiple charsets ... ahem, encodings please, on one
> filesystem transparently.

...as long as they are not in the file and filenames simultaneously --
unless you propose to make distinction between text files and non-text
ones, text fields with attributes in binary files,... Oops, I think, I
remember some system that tried to do that.

I still don't think that there is any understandable reason for kernel
that does anything non-transparently, and libc, since it deals with
charsets anyway, has no need to use Unicode. Unicode can be useful for the
rare operation of conversion between different charsets of the same
language (say, mail gateway between systems that traditionally use
different charsets, and failed to get into agreement about the common
one), but this is an application issue, not kernel and not libc.

> Databases and their contents are a userspace problem; libc has the
> transliteration routines should you need them.

What do you mean by "transliteration"? Unicode conversion? Lossy
translation between local charsets (the definition of "transliteration"
that I know)? Database is the least likely to be in Unicode, and the most
likely to need language-dependent processing -- say, phonetic match.

--
Alex

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu