UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)

From: Marc Lehmann
Date: Mon Feb 16 2004 - 13:39:16 EST

[I may be a bit late in response, but AFAICS these points have not yet
been mentioned]

On Sat, Feb 14, 2004 at 06:41:20PM -0800, Linus Torvalds <torvalds@xxxxxxxx> wrote:
[discussion on why UTF-8 is the only sane encoding, which I absolutely
agree with, removed]

> In short: the kernel talks bytestreams, and that implies that if you want
> to talk to the kernel, you HAVE TO USE UTF-8.

This is not the problem at all. It's perfectly easy to write
applications that talk UTF-8 and just UTF-8 with the kernel.

The problem is that the kernel does not use UTF-8, i.e. applications in
the current linux model have to deal with the fact that the kernel
happily breaks the assumed protocol of using UTF-8 by delivering illegal
byte sequences to applications.

There is no way for applications to handle UTF-8 and illegal-utf8 in
a sane way, so most apps will either eat the illegal bytes, skip the
filename, or crash (the latter case is clearly a bug in the app, thr
former cases aren't).

Fixing the VFS to actually enforce what linus claims (2filenames are
utf-8") is a very good idea, imho.

As I understand it, the reason linux currently doesn't, is that this utf-8
rule was obviously non-enforcable in practise in recent years, since
UTF-8 simply wasn't widespread (even today, applications such as bash or
grep are clearly not UTF-8 ready, as they start to crawl in UTF-8 locales
without special patches, and even with special patches).

So the only sane way to implement this enforcement is usign an
additional moutn-flag, e.g. "force-utf8".

An encoding=xyz mount flag OTOH would be total overkill, as the plan
must be to switch to UTF-8 in the long run, while allowing deviating
behaviour in the short run.

Conversely, filesystems such as NTFS, VFAT etc. need to convert from the
fs encoding to UTF-8 and vice versa automatically, at least when this
flag is specified.

It should become the default in some future linux version.

> People understand the problem. And UTF-8 is the solution.

The kernel needs to fully implement it. Just as a kernel accepting:

open ("directory", O_WRONLY); write (dirfd, ...)...
open ("/some/file", ...)
mkdir ("../some/file", ...)

is considered rather broken behaviour from unix kernels (although these
might have been allowed in some dialects or versions of unix) today, this:

mkdir ("</ encoded using illegal multibyte sequence>", ...)

will be considered broken behaviour in the future. The RFC defining UTF-8
clearly considers this a bug in UTF-8 implementations, the the kernel
in fact does NOT implement UTF-8 right now, although some people claim
that the kernel accepting UTF-8 (and more) is correct behaviour, it isn't
according to the RFC.

> It's getting there. I think even Microsoft has seen the light, and is
> phasing out their crapola (UCS-2LE? Whatever).

Microsoft and Java officially use UTF-16 nowadays. The funny thing is
that "next character" iterators in both languages skip to the next word
in UCS-2, so the claim of both parties of UTF-16 support is basically a
marketing lie.

> No. Things like "iocharset" are not the solution. They are literally the
> _problem_. The solution is to use something that not only acts as ASCII,
[full agreement]

> And that one true format is UTF-8. End of story. If you try to talk to the
> kernel in UCS-2 or anything else, you _will_ fail.

Just that the kernel does not support UTF-8. It delivers and accepts
non-UTF-8 strings such as \xc0\x80. The kernel clearly should not deliver
broken characters when the official stanza is that the linux VFS API is
UTF-8 only (see 3.2, Chapater 3, C12, conformance, ony why it currently
isn't UTF-8).

-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / pcg@xxxxxxxx |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/