Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Teunis Peters (teunis@usa.net)
Tue, 19 Aug 1997 19:14:29 -0600 (MDT)


On Tue, 19 Aug 1997, Alex Belits wrote:

> On Mon, 18 Aug 1997, Teunis Peters wrote:
>
> > Beyond that the Chinese still (AFAIK) decided whether or not to actually
> > USE unicode [the language has other ways of creating new characters - this
> > is not something computers are good at handling], Unicode has largely been
> > accepted [mostly by fiat].
>
> AFAIK, Chinese, Japanese and Russians _oppose_ Unicode that is mostly
> pushed by people who use iso8859-1 anyway, and thus have trivial mapping
> between their native charset and Unicode.

I don't know what the Japanese or Russian reasons are for opposition but
the Chinese have REALLY good reasons for disliking Unicode....

Lack of ability to handle a growing, changing character set being the
chief problem.

> > Personally I think Unicode is a really good idea... I like the idea of
> > being able to put descriptive filenames in files.
> > sometimes the native language [eg Japanese] is the only way to describe a
> > file.
>
> ...and people use native charsets/encodings for a long time already --
> and then Unicode appeared to "make it possible". Traditionally everything
> network-related was supposed to either be ASCII-only or use MIME charsets
> definitions. It worked fine in Russia and Japan (I have no information
> about China or Korea), but now Unicode supporters are trying to push
> "mandatory" Unicode into HTML. They completely ignore that HTTP is never

FWIW - 3 standard encodings in Japanese.... And no way to tell the
difference. For me anyways Unicode-2.0 [ISO-whatever actually] makes life
MUCH easier.... I can't really afford to try and hunt down all the
miriads of encodings anyways.

AFAIK Chinese has about 4-5 encodings, not counting countries that
incorporate other character sets as well (Korean, Japanese <somewhat>, and
so on)

I want Unicode in _ALL_ string actions in my GUI. Not just WWW.
Besides, I don't use WWW all that often (and character maps CAN be
afforded there - they are too slow in GUI)

> used without HTTP header (or META tags), and everyone learned how to add
> charset tags there long ago. The same for FTP, even though the only two
> known platforms that "support" Unicode at filesystem level are
> Windows NT and plan9, while others have absolutely no means even to
> provide reliable translation to the "local" charsets because "local" may
> be different for different users on the same box -- the concept completely
> unknown for Windows FTP servers authors who support that in FTP-WG mailing
> list.

So FTP-client on NT box is logging into a 'nt domain'.... Though from what
I've heard about NTFS this is a particularly stupid assumption as
translation tables are stored in the filesystem.

> > Not that it matters but I think as long as filenames from 16bit+
> > filesystems should be encoded into UTF-8 before being passed to the user.
>
> ...thus requiring to distinguish them from "normal" data everywhere and
> breaking every piece of software that should treat data in files as
> filenames (say, "make").

No - this should work fine. It sure beats English-oriented OS's as well
:)

It's just an 8-bit filename... it's not like Makefile/makefile gets
scrambled or anything. [or .c/.C/.c++/.java/... the rest of the filename
can be 8bit for what it matters].

In other words : EXPLAIN!

I was just trying to figure out how 16-bit filesystems are translated in
Linux.... Why not UTF-8 encode them?

> > So what filesystems are dependant on what character set?
> >
> > FAT : 8-bit IBM-PC
> > VFAT : 16-bit Unicode
(UCS-2)
> > ext-2 : Latin-1? (though UTF-8 is supported)
>
> 8-bit, not Latin-1. Latin-1 (iso8859-1) is one of charsets used with it.
>
> It's hard to "not support" Unicode -- it's just 8 bits. It's already used
> for local encodings, and there is _NO_SUCH_PROBLEM_ as "foreign languages
> support" in the 8-bit-clean filesystem. The only two problems that exist
> and can be solved by Unicode are:
>
> 1. charset tagging (Unicode is so large, it includes everything, or at
> least, authors think so);

It's got a ways to go - and it can never include 'everything' until it's
capable of storing non-linear and changeable characters (eg Chinese)

I can't afford ISO specs. I have Unicode-2.0 info. [also
EUS/JIS/SJIS/Han/.... but just about no western standards].

The only way I know what Latin-1 encoding characters are is by
reverse-engineering the translation in Sun's Java code.

> 2. compatibility with Windows NT.
>
> For the first one the solution creates more problems than it solves (such
> as breaking everything that treats data as filenames unless everyone will
> switch to 16-bit Unicode or, worse, UTF-8, what is the last thing any sane
> non-English-speaking person will do to his language). Second one is not
> something I care about (non-iso8859-1-speaking Windows NT users aren't
> that fond of Unicode anyway).

Heck - It _STILL_ beats english-only. And for that matter all the
hundreds of character mappings.... which are expensive docs.

One could do worse than Unicode-2.0. IBM-PC character set anyone?

And hey, the trick is in the translation anyways. Unicode [the 32bit
variant Unicode-2.x] is a handy way of storing international strings. Who
says the keyboard has to be US-english or the display ISO-8859-1?

The only written character set I've really seen any problems with is
Chinese (the 'common' written form). [apparently Russian has problems but
I don't know why? I thought they HAD a fixed alphabet!]

Sorry to mess lkernel up with this but I have a feeling this whole debate
needs to be figured out... I was just curious how translations from
16bit+ filesystems was handled?

Don't say [cut top 8 bits] or [just leave it].... Both solutions mess
things up. I vote UTF-8 translation.

This is _JUST_ for 16bit+ filesystems. Not 8bit filesystems like ext2 or
the like.

If there is going to be no standard at least one should be able to probe
the filesystem type and apply an appropriate translation (eg: this FS is
vfat ergo std. 16bit Win95-encoded Unicode translated to UTF-8)

Incidentally, the first 7 bits are still normal in UTF-8 so all those
english filenames (Makefile for example) are NOT touched!!!!!

Does that explain what I said earlier?

Have a nice day, eh?
- Teunis