Re: UTF-8 practically vs. theoretically in the VFS API

From: H. Peter Anvin
Date: Tue Feb 17 2004 - 22:24:41 EST


Linus Torvalds wrote:
>
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
>
>>Those of us who have been involved with the issue have fought
>>*extremely* hard against DWIM decoders which try to decode the latter
>>sequences into ".." -- it's incorrect, and a security hazard. The
>>only acceptable decodings is to throw an error, or use an out-of-band
>>encoding mechanism to denote "bad bytecode."
>
> Somebody correctly pointed out that you do not need any out-of-band
> encoding mechanism - the very fact that it's an invalid sequence is in
> itself a perfectly fine flag. No out-of-band signalling required.
>
> The only thing you should make sure of is to not try to normalize it (that
> would hide the error). Just keep carrying the bad sequence along, and
> everybody is happy. Including the filesystem functions that get the "bad"
> name and match it exactly to what it should be matched against.
>

Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes. Consider a UTF-8 decoder which uses
values in the 0x800000xx range to encode "bogus bytes"; that way it
wouldn't alias to anything else, but the bogus sequence "C0 AE" could be
represented as 0x800000C0 0x800000AE and displayed to the user as
\xC0\xAE\xC0\xAE ... which is different from \u00C0\u00AE ("À®", C3 80
C2 AE). This would make it possible to figure out in, for example, an
ls listing, what those broken filenames are actually composed of.

There are some advantages to being able to represent all possible byte
sequences and present them to the user, even if they're bogus.

-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/