Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Kai Henningsen (kai@khms.westfalen.de)
26 Aug 1997 02:30:00 +0200


aem@netcom.ca (Andrew E. Mileski) wrote on 21.08.97 in <199708211310.JAA00467@netcom.ca>:

> > > How about this: All filesystems give textual metadata (filenames, that
> > > is) in UTF-8 (so that most things will mostly work, before they are
> > > re-written to deal with UTF-8 explicitly).
> >
> > Why UTF-8?
>
> Of course, we could make our own encoding, say Linux-8, where bit 7 set
> means there are 7 more bits in the next byte to look at. A pair of codes
> could be reserved as a "terminator" and "separator", and used universally.
> This scheme is infinitely expandable, and not limited like UTF-8. <Laugh>
> Should the day come, we could support a multitude of alien dialects :-)

Like with UTF-8, you mean? :-)

ISO 10646 has 2^31 possible characters. UTF-16 has 2^16+2^20 possible
characters. Current plans seem to use about 2^18 of these characters for
the forseeable future.

UTF-8 could easily support 2^36 different characters. That's 2^18 types of
aliens that need the same amount of characters as we do. That's quite a
lot.

Oh, perhaps a very short summary of what all these are, since some people
seem to have trouble to keep them apart (and I hope I don't confuse things
myself).

ISO 10646. ISO standard about these things; a 31 bit code, several
encodings.

Unicode. A standard by the Unicode consortium; a 16 bit code.

(UCS = Universal Character Set)
UCS-4. 32 bit encoding of ISO 10646 (I believe in network byte order).
UCS-2. The lower 2^16 positions of UCS-4 (the "base page"), in a 16 bit
encoding. Unicode 1.x.

(UTF = UCS Transformation Format)
UTF-16. A 16 bit encoding, with some two-word encoded characters, for
representing the first 2^16+2^20 positions of UCS-4. Unicode 2.0.
UTF-8. A 8 bit encoding for representing all of UCS-4 with varying-length
multibyte characters. ASCII remains ASCII. Originally invented for
filename storage.
UTF-7. A 7 bit encoding for representing all of UCS-4; most of ASCII
remains, but not all; anything else is basically in base64
encoding. Primarily for channels that aren't 8 bit clean.

Current usage (IIRC, and _very_ rough):

0x00000000-0x0000ffff: base page. The "normal" stuff. Originally every
character that was in any ISO charset, or in one of
the more popular vendor charsets.
0x00010000-0x0002ffff: standard extensions (mostly planned; stuff like
exotic character sets (like egyptian
hieroglyphics), imaginary character sets (like
Klingon), and the famous ever-expanding chinese
name signs)
0x00030000-0x000fffff: free/private use, accessible with UTF-16
0x00100000-0x7fffffff: free/private use, not accessible with UTF-16

The web pages should cover this in more detail.

> I personally don't think the kernel should need to know more than one
> encoding. Any translation functionality can go in the user space.

Especially since ANSI C already defines such translation functions. I'm
not sure if glibc implements these; in any case, they're related to
locales, which would seem to be the right place.

MfG Kai