Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Tue Jan 05 2021 - 13:00:34 EST


On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > >
> > > > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > > >
> > > > cpu0 cpu1 cpu2
> > > > ---- ---- ----
> > > > [ Writable PTE
> > > > cached in TLB ]
> > > > userfaultfd_writeprotect()
> > > > [ write-*unprotect* ]
> > > > mwriteprotect_range()
> > > > mmap_read_lock()
> > > > change_protection()
> > > >
> > > > change_protection_range()
> > > > ...
> > > > change_pte_range()
> > > > [ *clear* “write”-bit ]
> > > > [ defer TLB flushes ]
> > > > [ page-fault ]
> > > > ...
> > > > wp_page_copy()
> > > > cow_user_page()
> > > > [ copy page ]
> > > > [ write to old
> > > > page ]
> > > > ...
> > > > set_pte_at_notify()
> > >
> > > Yuck!
> > >
> >
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> >
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
>
> Lets assume CPU0 does a read-lock, W -> RO with deferred flush.

I was mistaken saying the deferred tlb flush was not shown in the v2
trace, just this appears a new different case we didn't happen to
consider before.

In the previous case we discussed earlier, when un-wrprotect above is
called it never should have been a W->RO since a wrprotect run first.

Doesn't it ring a bell that if an un-wrprotect does a W->RO
transition, something is a bit going backwards?

I don't recall from previous discussion that un-wrprotect was
considered as called on read-write memory. I think we need the below
change to fix this new case.

if (uffd_wp) {
+ if (unlikely(pte_uffd_wp(oldpte)))
+ continue;
ptent = pte_wrprotect(ptent);
ptent = pte_mkuffd_wp(ptent);
} else if (uffd_wp_resolve) {
+ if (unlikely(!pte_uffd_wp(oldpte)))
+ continue;
/*
* Leave the write bit to be handled
* by PF interrupt handler, then
* things like COW could be properly
* handled.
*/
ptent = pte_clear_uffd_wp(ptent);
}

I now get why the v2 patch touches preserved_write, but this is not
about preserve_write, it's not about leaving the write bit alone. This
is about leaving the whole pte alone if the uffd-wp bit doesn't
actually change.

We shouldn't just defer the tlb flush if un-wprotect is called on
read-write memory: we should not have flushed the tlb at all in such
case.

Same for hugepmd in huge_memory.c which will be somewhere else.

Once the above is optimized, then un-wrprotect as in
MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in
MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition
that just clears the uffd_wp flag and nothing else and whose tlb flush
is in turn irrelevant.

The fix discussed still works for this new case too: I'm not
suggesting we should rely on the above optimization for the tlb
safety. The above is just a missing optimization.

> > > Isn't this all rather similar to the problem that resulted in the
> > > tlb_flush_pending mess?
> > >
> > > I still think that's all fundamentally buggered, the much saner solution
> > > (IMO) would've been to make things wait for the pending flush, instead
> >
> > How do intend you wait in PT lock while the writer also has to take PT
> > lock repeatedly before it can do wake_up_var?
> >
> > If you release the PT lock before calling wait_tlb_flush_pending it
> > all falls apart again.
>
> I suppose you can check for pending, if found, release lock, wait for 0,
> and re-take the fault?

Aborting the page fault unconditionally while MADV_DONTNEED is running
on some other unrelated vma, sounds not desirable.

Doing it only for !VM_SOFTDIRTY or soft dirty not compiled in sounds
less bad but it would still mean that while clear_refs is running, no
thread can write to any anon memory of the process.

> > This I guess explains why a local pte/hugepmd smp local invlpg is the
> > only working solution for this issue, similarly to how it's done in rmap.
>
> In that case a local invalidate on CPU1 simply doesn't help anything.
>
> CPU1 needs to do a global invalidate or wait for the in-progress one to
> complete, such that CPU2 is sure to not have a W entry left before CPU1
> goes and copies the page.

Yes, it was a global invlpg, definitely not local sorry for the
confusion, as in the PoC posted here which needs cleaning up:

https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@xxxxxxxxxx

+ flush_tlb_page(vma, vmf->address);

I think instead of the flush_tlb_page above, we just need an
ad-hoc abstraction there.

The added complexity to the page fault common code consist in having
to call such abstract call in the right place of the page fault.

The vm_flags to check will be the same for both the flush_tlb_page and
the wait_tlb_pending approaches.

Once the filter on vm_flags pass, the only difference is between
"flush_tlb_page; return void" or "PT unlock; wait_; return
VM_FAULT_RETRY" so it looks more an implementation detail with a
different tradeoff at runtime.

Thanks,
Andrea