Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
From: Andrea Arcangeli
Date: Tue Jan 05 2021 - 13:47:21 EST
On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote:
> I would feel more comfortable if you provide patches for uffd-wp. If you
> want, I will do it, but I restate that I do not feel comfortable with this
> solution (worried as it seems a bit ad-hoc and might leave out a scenario
> we all missed or cause a TLB shootdown storm).
>
> As for soft-dirty, I thought that you said that you do not see a better
> (backportable) solution for soft-dirty. Correct me if I am wrong.
I think they should use the same technique, since they deal with the
exact same challenge. I will try to cleanup the patch in the meantime.
I can also try to do the additional cleanups to clear_refs to
eliminate the tlb_gather completely since it doesn't gather any page
and it has no point in using it.
> Anyhow, I will add your comments regarding the stale TLB window to make the
> description clearer.
Having the mmap_write_lock solution as backup won't hurt, but I think
it's only for planB if planA doesn't work and the only stable tree
that will have to apply this is v5.9.x. All previous don't need any
change in this respect. So there's no worry of rejects.
It worked by luck until Aug 2020, but it did so reliably or somebody
would have noticed already. And it's not exploitable either, it just
works stable, but it was prone to break if the kernel changed in some
other way, and it eventually changed in Aug 2020 when an unrelated
patch happened to the reuse logic.
If you want to maintain the mmap_write_lock patch if you could drop
the preserved_write and adjust the Fixes to target Aug 2020 it'd be
ideal. The uffd-wp needs a different optimization that maybe Peter is
already working on or I can include in the patchset for this, but
definitely in a separate commit because it's orthogonal.
It's great you noticed the W->RO transition of un-wprotect so we can
optimize that too (it will have a positive runtime effect, it's not
just theoretical since it's normal to unwrprotect a huge range once
the postcopy snapshotting of the virtual machine is complete), I was
thinking at the previous case discussed in the other thread.
I just don't like to slow down a feature required in the future for
implementing postcopy live snapshotting or other snapshots to userland
processes (for the non-KVM case, also unprivileged by default if using
bounce buffers to feed the syscalls) that can be used by open source
hypervisors to beat proprietary hypervisors like vmware.
The security concern of uffd-wp that allows to enlarge the window of
use-after-free kernel bugs, is not as a concern as it is for regular
processes. First the jailer model can obtain the uffd before dropping
all caps and before firing up seccomp in the child, so it won't even
require to lift the unprivileged_userfaultfd in the superior and
cleaner monolithic jailer model.
If the uffd and uffd-wp can only run in rust-vmm and qemu, that
userland is system software to be trusted as the kernel from the guest
point of view. It's similar to fuse, if somebody gets into the fuse
process it can also stop the kernel initiated faults. From that
respect fuse is also system software despite it runs in userland.
In other words I think if there's a vm-escape that takes control of
rust-vmm userland, the last worry is the fact it can stop kernel
initiated page faults because the jailer took an uffd before drop privs.
Thanks,
Andrea