Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)

From: Jamie Lokier
Date: Tue Feb 17 2004 - 15:33:44 EST


Linus Torvalds wrote:
> And it would be _trivial_ to add a LOOKUP_NODOTDOT and allow user space to
> use it through a O_NODOTDOT thing.

Nope. That wouldn't help for a bundle of libraries that goes:

1. Eliminate "." and ".." components, leaving only leading ".."s.
2. Reject path if it has a leading "..".
3. Shove it in a string with some other text and pass to other library.

Next program does:

4. Extract path from string.
5. open ("/var/public/files/$PATH", ...)

O_NODOTDOT won't protect against that.

> Same goes for O_NOFOLLOW or O_NOMOUNT, to tell the kernel that it
> shouldn't follow symbolic links or cross mount-points - another thing that
> some software might want to use in order to check that you can't "escape"
> your subtree.

( O_NOMOUNT is a good idea. I like O_NOFOLLOW - already use it to
avoid lstat() calls. )

> But note how my point was that YOU SHOULD NEVER EVER MUNGE A PATHNAME!
>
> It is fundamentally _wrong_ to convert pathnames. You _cannot_ do it
> correctly.

I know. You know. I think everyone else got it the first time too ;)

Have you ever written a script which takes a pathname and puts it in a
text file, and passes to another to operate on, and just skipped over
the details of what a poorly placed control character would do?

Real applications do exactly that sort of thing. Mostly it works,
occasionally security holes are found. Welcome to the land of dodgy
CGI scripts and program generated Makefiles.

This is the same.

> - always _always_ work on the "extended UTF-8" format, and never EVER
> convert that to anything else (except when you need to actually print
> it, but then you encode it properly with escape sequences, the way you
> have to _anyway_).
>
> If you follow the above simple rules, you can't get it wrong. And in those
> rules, ".." is the BYTE SEQUENCE in the "extended UTF-8". Nothing more.

Yup. It works right up until you pass your string to a library which
doesn't follow that rule, and which munges malformed UTF-8 because
it's _expecting_ well formed UTF-8. E.g. you pass a path in an XML
document; the XML parser at the other end will either munge your path
(causing a security hole), or reject it (which is good).

The right thing to do on these occasions is check and/or escape
"extended UTF-8" prior to putting it into a text context.

Practically, that means a UTF-8 aware program has to keep track of
which text is "extended UTF-8" (i.e. bytes), and which text is real UTF-8.

Practically, it means every interface where a path may be passed in a
UTF-8 string has to define whether that's an escaped path, which will
be unescaped before being used for a system call, or an unescaped path.
Then you get into what kind of escaping.

In theory all those checks and escapings will be in the right places.
In theory C programs don't have buffer overflows either.

It is exasperated because UTF-8 is often passed through middle-layer
programs and libraries that don't know anything about it, so when
assembling a whole system it's all too easy to lose track of where to
put the checks and escapings - and where not to.

Yes there _is_ a perfectly fine solution: the one you gave.

In practice it is difficult to ensure a whole system where paths are
mixed with text is consistent about that. And that's where we get a
good selection of our Windows worms from.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/