Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:

From: Nicolas Mailhot
Date: Tue Feb 17 2004 - 07:38:14 EST


|Alex Belit a Ãcrit :
|
|On Mon, 16 Feb 2004, Marc Lehmann wrote:
|
|> > I have never claimed that the kernel really talk s UTF-8, and indeed, I
|> > would say that such a kernel would be terminally and horribly broken.
|>
|> And I'd say such a kernel would be highly useful, as it would standardize
|> the encoding of filenames, just as unix standardizes on "mostly ascii"
|> (i.e. the SuS).
|>
|> However, just as POSIX is a nice but very limited base, (mostly) ASCII
|> is a nice and very limited base. UTF-8 would also be a good base.
|
| UTF-8 is dependent on Unicode, that is cumbersome [...] Enforcing UTF-8
| will burn the bridges to any other language support infrastructure or
| encoding, right at the time when such infrastructure is likely to be created.

Quite the contrary. The current UTF-8 migration shows the major showstopper
when changing filename encodings is right know you don't know what damned
encoding to convert from.

With a clear policy (for *current* encodings) one can change.

Without one you're reduced to expensive guesswork (ie *humans* have to spend
*days* checking the conversion worked as expected.)

I happen to hate imperial units. *My* country switched to full metric more than
two hundred years ago. However I'll take a value in imperial units any day over
some number without explicit unit any day.

Implicit unit/encoding is a damn stupid thing to do. There are numerous examples
of big expensive projects that failed because of this kind of misunderstanding.
Many apps and humans need to interpret filenames to perform their job.

(BTW if anyone cares I was raised next to a computer which primary purpose was
translating to a non-latin language. So I know quite a lot of the recipes for
"getting by" and having worthless archives after a few years)

|> 8-bit bytes as filenames is not a good base, however, since they enforce
|> a difefrent layer of interrpetation between the user and the kernel, and
|> this interpretation cannot be based on the locale nor the filesystem
|> itself, as there is no way to find out what encoding the filename is in.
|
| This is a matter of GUI implementation. If someone cared about this, he
|would store language metadata with filename, too, however this is clearly
|contrary to the Unix filesystem design.

If you think filename interpretation is GUI-only stuff you're sadly mistaken.
Filename-based processing is widespread.

|> 8-bit bytes is convinient, but not useful for i18n environments. in the
|> past, it was also convinient and nobody cared, since everything was
|> either 8-bit or double-byte, and nobody exchanged files.
|
| I did, and it worked _fine_. Everyone who is willing to use UTF-8 is
|free to do this right now, and everything will already work great for
|them. Writing software to deliberately enforce UTF-8 is something
|completely different from using UTF-8 for yourself.

|> This, however, is going to change, and the current methodology of "just
|> guess, you might be right" is a hindrance to this.
|
| This was "going to change" for more than a decade already, or,
|alternatively, already happened if you listen to someone like Martin
|Duerst. The reality is, everything can pass UTF-8 already, yet people use
|other encodings for everything, too, and as long as they don't break,
|things work.

Till a certain point.
Past this point all the heuristics in the world won't help you and people
suddenly revise their "work" definition.

| Breaking byte-value transparency in any place in the system
|is counterproductive

There is nothing transparent in the system for filename users.
Generalised guesswork is not transparency.

[...]

|> However, just as with URLs (which are byte-streams, too), byte-streams are
|> useless to store text. You need bytestreams + known encoding.
|
| MIME has a perfectly usable standard for declaring encodings, and huge
|amounts of text (that may include filenames) are distributed by
|MIME-compliant or MIME-like protocols (mail and HTTP, to name two).

Fine. Just convert all your filenames to garbage at see how great it is their
contents are still readable because the file formats have encoding info. I'm
pretty sure you'll still miss your nice filenames.

Let me repeat my point :
1. filenames have a meaning
2. the meanings are important
3. they can not be reliably decoded without encoding info

Therefore encoding info needs to be added, using either FS metadata or a clear standard.
And I don't care if the standard is UTF-8, UCS-foo, egyptian hieroglyphs or whatever.
I want a f* standard. Every single person that had to work on the mess that results now
from many users using different incompatible locales on a single FS want a f* standard.

Someone wrote about it being akin to changing read() write() to do encoding conversion
on the fly. This is blatantly false - filename contents are userspace-level and an app
isn't expected to read other app files. And an app can use formats that declare file
encoding. But any app *will* need to read files it didn't generate because they happen to
reside in the same directory. And it *won't* be able to specify filename encoding because
the filename format belongs to the kernel so it's the *kernel* job to provide encoding
info somewhere so app authors can interpret it correctly.

Sorry, we won't do it is not a valid answer.

App writers have solved what they could - file contents (which are encoding-aware now
thanks to xml and friends). What they can not solve without kernel help is filename
encoding - because filenames are shared unlike files, and it requires a system-level
decision.

--
Nicolas Mailhot

Attachment: signature.asc
Description: Ceci est une partie de message=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=