Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)

From: Marc Lehmann
Date: Mon Feb 16 2004 - 16:00:07 EST


On Mon, Feb 16, 2004 at 12:23:25PM -0800, Linus Torvalds <torvalds@xxxxxxxx> wrote:
> > And I am saying that this is not good, which is my sole point.
>
> Fair enough.

Thank you (honestly), for acknowledging it.

> However, that's where the unix philosophy comes in. The unix philosophy
> has always been to not try to understand the data that the user passes
> around - and that "everything is a bytestream" is very much encoded in the
> basic principles of how unix should work.

This never really applied to paths or filenames, although I admit that
this was the intention indeed, hindred by the need of having _some_
out-of-band data like '/'.

So yes, that's the principle.

> That agnosticism has a lot of advantages. It literally means that the
> basic operating system doesn't set arbitrary limitations, which means that
> you can do things that you couldn't necessarily otherwise easily do.

I think '/' is arbitrary, but of course there is a difference between
ruling out 2 bytes out of 256 as opposed to ruling out hundreds of
thousands of combinations.

However, UTF-8 was invented to transform unicode _text_ to filenames by
the inventors of unix.

Yes, this does not mean that they intended your typical kernel to
enforce this, but still this is what plan9 does, and it's a very useful
"limitation" because it removes the need for difficult-to-get-by-with
out-of-band data stating the interpretation of filenames.

Unlike file contents, filenames are always meant to represent text, not
binary data, which I hope you agree with.

> It does mean that you can do "strange" things too, and it does mean that
> user space basically has a lot of choice in how to interpret those byte
> streams.

Windows, for example, allows a lot more strange things like "x.y." to
mean the same as "x.y", or even worse things playing with "illegal
UTF-8".

Most, if not all, of these strange exceptions in windows resulted in
exploitable security flaws.

While this certainly is workable (in the sense that these all were
applicaiton bugs), I think it would be highly useful to be able to enforce
this by saying: "you can put whatever you like into filenames, as long as
it is encoded using UTF-8".

Just like the kernel currently enforces "it must be encoded using
octects", which is trivial, of course, but consider that UCS-2, while also
being a bytestream, is NOT a valid encoding for filenames in linux.

So the kernel is certainly not agnostic versus encodings.

It is, however, agnostic to multibyte encodings as used in e.g. ISO-C.
It does not support wide character encodings.

This is not bad at all, but I claim that the kernel simply isn't as
agnostic as we all might wish, and if there is a specification on
well-formed encodings, it could just as well be "UTF-8".

(Right now it is "most 8-bit encodings, all multibyte encodings but no
multiword encodings with embedded nulls or '/'", which is simply too
unsharp for me to be acceptable. For me, this is clearly a subjective
opinion).

> And yes, it can cause confusion. You don't like the confusion, so you
> argue that it shouldn't be allowed. It's a valid argument, but it's an
> argument that assumes that choice is bad.

Actually, I called for an optional way of enforcing a very promising
encoding over a lot of other encodings.

The kernel simply supports only a subset of possible encodings, and I said
that being able to limit this choice to the only encoding that even _you_
call the only sane one, as an option.

> If you want to _force_ everybody to use UTF-8, then yes, the kernel could

I don't want to do that. I sitll happen to find the rare iso-8859-1
filename on my disk, and almost puke everytime because it breaks bash
etc. I even modified my terminal emulator (rxvt) to support the
occasional german umlaut because it's so convinient.

But all in all, it would be very much easier if this simply couldn't
happen. Especially because it's easy to get from "illegal utf-8 allowed"
to "security issue in formerly secure code".

> enforce that readdir() would never pass through a broken UTF-8 string, and
> all the path lookup functions also would never accept a broken string. It'
> snot technically impossible to to, although it would add a certain amount
> of pain and overhead.

I can certainly yield to arguments like "if you want it, send a patch good
enough to be accepted" :)

> But the thing is, not everyone uses UTF-8.

I don't want to enforce it, exactly because, as you say, it'll
take years. I do think that a way to enforce this as local (or
distribution-wise) policy, would be very helpful indeed.

I don't mind if some embedded system doesn't want the overhead of having a
UTF-8 environment that simply isn't useful for it.

> people from doing clever things. Encoding other information in filenames
> might be proper for a number of applications.

But it will simplify a lot of applications that otherwise have no idea on
filename encoding (it's a pain in the ass in perl, which has generally
good (unicode-)text support, but still filenames, which are text, simply
come out garbled most of the time).

It's rather similar to the discussion about extended attributes (ok, it's
hornets nest, or at leats was one :). Having a clear API definition would
certainly help.

It's no policy either, as applicaitons wanting to encode e.g. binary
information can certainly do so using UTF-8, just as they currently can do
so by other means, e.g. escaping \0 and /.

> It would also be very painful, since it would mean that when you mount an
> old disk, you may be totally unable to read the files, because they have
> filenames that such a kernel would never accept.

That's why I asked for a mount option. I might be too extreme for your
taste in my opinions, but I am still living on earth. I, too, have these
disks.

However, I also have some windows disks, and umlauts or japanese
characters on these often result in chaos, which is not really a bug in
linux either, but highly annoying to people trying to use linux.

> You don't "need" a known encoding. The kernel clearly doesn't need one.
> It's a container, and the encoding comes from the outside.

Right now the kernel does need encoding for _some_ byte values. I can't
use UCS-2, it simply isn't agnostic.

Please, I don't want to use UCS2, UTF-16, or other such atrocities :) I
am just trying to make the point that the kernel already enforces some
encodings.

> And that's what I mean by agnostic - you can make your own encoding.

Given the limitations of the kernel in interpreting byte streams, I can.
However, only under the existing constraints.

I'd like to have a more forcing constraint - UTF-8. It might never be
implemented (especially not against your will), but I think it isn't more
idiotic than what the kernel currently does.

Again, I find the "restrictions" unix has with respect to filenames
very sane and useful, and if at all, it's usually a problem that
filenames can contain unusual characters like \n.

> Choice is _inherently_ good. Trying to force a world-view is bad. You

Yes! I want the choice of having a kernel supporting UTF-8 (again, meaning
active support, i.e. not accepting malformed utf-8) :)

> UTF-8"), but you should not _force_ them to that if they have good reasons
> not to (and "backwards compatibility" is a better reason than just about
> anything else).

Well, being able (having the choice) to force is good. Just like having
nosuid mount options is good, because it allows admins the choice.

> No. I'm saying that
> (a) "if you want to use complex character sets"
> then
> (b) "you really have to use UTF-8"
> to talk to the kernel.
>
> Note the two parts. You're hung up on (b), while I have tried to make it
> clear that (a) is a prerequisite for (b).

Ok, this is true.

> (In all fairness, some people will disagree with (b) even when (a) is true
> and like things like UCS-2. Those people are crazy, but I guess I'd just
> mention that possibility anyway).

You have to admit, however, that apart from UCS-2 being obvious insanity
as opposed to UTF-16 or some 32 encoding, that some people have a point
in asking for these.

It IMHO not useful to support this simply because the posix API cannot be
made to deal with it. But I wouldn't rule out encodings like UTF-32 as
simply being crazy.

> This may change, btw. I'm nothing if not pragmatic. In another twenty
> years, maybe everybody _literally_ uses complex character sets, and this
> whole discussion is totally silly, and the kernel may enforce UTF-8 or
> Klingon or whatever. At some point assumptions become _so_ ingrained that
> they are no longer policy any more, they are just "fact".

True. Thanks a lot for explaining your arguments in this detail. In
fact, I can accept most if not all of your arguments, but I sitll think
it would be nice to have this extra functionality.

Arguments like "it's a pain to implement" (which I don't think it is, but
you are clearly better in judging that!), weigh even more to me.

So even if I think it's a good idea, it might never be implemneted for
purely practical reasons.

(end of discussion, I think, for me at last)

--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / pcg@xxxxxxxx |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/