Re: [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call

From: Al Viro
Date: Sun Feb 05 2017 - 17:05:03 EST


On Sun, Feb 05, 2017 at 10:19:20PM +0100, Miklos Szeredi wrote:

> Then we can't break out of that deadlock: we wait until
> fuse_dev_do_write() is done until calling request_end() which
> ultimately results in unlocking page. But fuse_dev_do_write() won't
> complete until the page is unlocked.

Wait a sec. What happens if

process A: fuse_lookup()
struct fuse_entry_out outarg on stack
...
fuse_request_send() with req->out.args[0].value = &outarg
sleep in request_wait_answer() on req->waitq
server: read the request, write reply
fuse_dev_do_write()
copy_out_args()
fuse_copy_args()
fuse_copy_one()
FR_LOCKED is guaranteed to be set
fuse_copy_do()
process C on another CPU: umount -f
fuse_conn_abort()
end_requests()
request_end()
set FR_FINISHED
wake A up (via req->waitq)
process A: regain CPU
bugger off from request_wait_answer(), through __fuse_request_send(),
fuse_request_send(), fuse_simple_request(), fuse_lookup_name(),
fuse_lookup() and out of fuse_lookup().

In the meanwhile, server in fuse_copy_do() does memcpy() to what used to
be outarg, corrupting the stack of process A.

Sure, you need to hit a fairly narrow window, especially if you are to
cause damage in A, but AFAICS it's not impossible. Consider e.g. the
situation when you lose CPU on preempt on the way to memcpy(); in that
case server might come back when A has incremented its stack footprint
again. Or A might end up taking a hardware interrupt and handling it
on the normal kernel stack, etc.

Looks like *any* scenario where fuse_conn_abort() manages to run during
that memcpy() has potential for that kind of trouble; any SMP box appears
to be vulnerable, along with preempt UP...

Am I missing something that prevents that kind of problem?

> The only way out that I see is to have a refcount on all pages in
> args. Which means copying everything not already in refcountable page
> (i.e. args on stack) to a page array. It's definitely doable, but
> needs time to sort out, and I'm definitely lacking that (overlayfs
> currently trumps fuse).

Hrm... Then maybe I'll have to try and cook something along those lines;
AFAICS the current mainline is vulnerable...