Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

H. Peter Anvin (hpa@transmeta.com)
26 Aug 1997 06:23:41 GMT


Followup to: <6caP1HYjcsB@khms.westfalen.de>
By author: kai@khms.westfalen.de (Kai Henningsen)
In newsgroup: linux.dev.kernel
>
> That turns out not to be the case. (Actually, both HPA's and yours.)
>
> UTF-8 is easily expandable to 2^36, which is a lot more than what we might
> need in the forseeable future, even if we happen to make contact with
> several million alien species using as many characters as we do.
>
> None of these is infinitely expandable. Not that it matters. They already
> allow ridiculous numbers.
>
> Except, that is, that base64, uuencode, or octal don't specify any
> character set definitions (they're just ways to represent any odd binary
> data), and UTF-8 does.
>

Actually, UTF-8 is open-ended; it is only defined to 2^31 at this
point; depending on how you extend it it could be expanded
indefinitely.

We already have:

0xxxxxxx for up to 7 bits
110xxxxx 10xxxxxx for up to 11 bits
1110xxxx (2 * 10xxxxxx) for up to 16 bits
11110xxx (3 * 10xxxxxx) for up to 21 bits
111110xx (4 * 10xxxxxx) for up to 26 bits
1111110x (5 * 10xxxxxx) for up to 31 bits

... we can then define ...

11111110 (6 * 10xxxxxx) for up to 36 bits
11111111 100xxxxx (7 * 10xxxxxx) for up to 41 bits
11111111 1010xxxx (8 * 10xxxxxx) for up to 46 bits
11111111 10110xxx (9 * 10xxxxxx) for up to 51 bits
11111111 101110xx (10 * 10xxxxxx) for up to 56 bits
11111111 1011110x (11 * 10xxxxxx) for up to 61 bits
11111111 10111110 (12 * 10xxxxxx) for up to 66 bits
11111111 10111111 100xxxxx (13 * 10xxxxxx) for up to 71 bits

... etc ...

All of this is a straighforward extension of UTF-8 (basically, the
length of the bit string is encoded in unary, followed by the bit
string itself. Except for the anomaly at the beginning (to make it
ASCII compatible) the bit length is always rounded up to the nearest
n*5+1. The number of leading ones is n+1, followed by a zero,
followed by the bit string. Any byte beginning with 10 is a
continuation byte; the bit string continues immediately after the 10
prefix.

-hpa

-- 
    PGP: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
    See http://www.zytor.com/~hpa/ for web page and full PGP public key
Always looking for a few good BOsFH.  **  Linux - the OS of global cooperation
        I am Baha'i -- ask me about it or see http://www.bahai.org/