Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Teunis Peters (teunis@usa.net)
Mon, 25 Aug 1997 10:45:00 -0600 (MDT)


On Wed, 20 Aug 1997, Peter Holzer wrote:

> Alex Belits wrote:
> >On Wed, 20 Aug 1997, Erik Corry wrote:
> >
> >> Unicode is regularly extended, and is incredibly complete in
> >
> >...by a commitee.
>
> Yes, of course. By who else?

Addendums donated by countries (then reviewed by commitee).

It really wouldn't be so bad if commitees actually worked... This one's
not TOO bad as they go. (remember Ada anyone?)

> >And they don't release free implementation of
> >it or updates to existing ones after that.
>
> This is a problem. And last time I heard the standard was only available
> on paper, which is not the best format for something which consists
> almost completely of tables.

www.unicode.org, yes? Textfiles full of tables. The visual description
AFAIK is only in paper form though.... (I've never seen the paper form -
I can't afford it)

> >> What
> >> more could you want?
> >
> >Japanese and Chinese characters encoding that Japanese and Chinese people
> >use, perhaps?
>
> Unicode does include Japanese and Chines characters. Some may be
> missing, of course, but they can (and should) be added.

If only Unicode supported continuously growing languages....
(Big5 does AFAIK but that's about it)

> >> Linux has already standardised on UTF-8 for the console.
> >
> >(looking at the console...) No, still looks like koi8-r for me... Having
> >the internal support doesn't mean that it's usable enough to make it
> >mandatory everywhere.
>
> Same here. At least on 2.0.30 (haven't any 2.1.x kernel running at the
> moment) the console is straight Latin-1, not UTF-8 (at least pressing
> the "ö" key gives me the single code F6, not C0 B6. And printing C0 B6
> to the console prints "À¶", not "ö". The escape sequences in unicode.txt
> don't switch to Unicode, either.

Funny - they work for me... (but wait a sec - I last used them in 2.1.24,
where they work... I think there's a bug here! - anyone back me up on
this one? [though the console source looks like it works - 2.1.51])

[clean-up and clipping]
>
> >I'm not aware of any development of Unicode-using tools. And unless
> >sh / bash / grep / awk /... will work with UTF-8 as with native characters
> >(that means, variable-length-encoded character is treated as one
> >character, and what I don't think, anyone will make any soon), no one will
> >use it for anything decent.
>
> The good news about UTF-8 is that most things will "just work". The bad
> news is that quite a lot of programs must be fixed to work properly in
> all cases. For example "grep ä foo" will find exactly the lines with one
> or more characters "ä" in it, even though that is represented by two
> bytes. Similarly "grep (ä|ö|ü) foo" will find the lines with "ä" "ö" or
> "ü" in it, but "grep [äöü] foo" will not. It will find a lot of other
> characters, too, unless grep (or rather regex in libc) knows about
> UTF-8. Similarly all programs which count characters (wc, less, vi, ...)
> must be adapted to handle multibyte characters. But this is true for all
> character sets with more than 256 characters.

Hmm.... Personally I'd like to see this implemented... but not everyone
wants it, so a UTF-8 native linux distribution, anyone?

[could be fun and a whole pile of changes... :]

Ciao!
- Teunis