Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)

From: viro
Date: Tue Feb 17 2004 - 14:56:39 EST


On Tue, Feb 17, 2004 at 07:29:18PM +0000, Jamie Lokier wrote:
> What happens is that one program or library checks an incoming path
> for ".." components - that code knows nothing about UTF-8 of course.
>
> Then it passes the string to another program which assumes the path
> has been subject to appropriate security checks, munges it in UTF-8,
> and eventually does a file operation with it. The munging generates
> ".." components from non-minimal UTF-8 forms - if it's not obeying the
> Unicode rejection requirement (which wasn't in earlier versions), that is.

Why the hell would it _ever_ do such normalization?

> A realistic example is where the second program reads files whose
> paths are mentioned in a text file which is parsed as UTF-8, after the
> first program has done a security check by grepping for ".."
> components.
>
> Unicode says the second program shouldn't accept malformed UTF-8,
> precisely because in real scenarios (like this one) there's a mix of
> programs and libraries, some aware of UTF-8, some not, and the latter
> are involved in security decisions.
>
> Here on linux-kernel we're saying that if the second program accepts
> any old byte sequence in a filename, it should preserve the byte
> sequence exactly. But any program whose parser-tokeniser is scanning
> UTF-8 is very unlikely to do that - it's just too complicated to say
> some bits of a text stream must be remembered as literal bytes, and
> others must be scanned as multibyte characters.

So what you are saying is that conversion of invalid multibyte sequences
into non-error wide chars followed by conversion back into UTF-8 can
lead to trouble? *DUH*

> The holes only arise because software which is interpreting UTF-8 is
> mixed with software which isn't. That's one of the most useful
> features of UTF-8, after all - that's why we use it for filenames.

The holes only arise because software which is interpreting UTF-8 doesn't
care to do it properly. Software that doesn't interpret it (including the
kernel) doesn't enter the picture at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/