Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
From: Andrea Arcangeli
Date: Mon Jan 04 2021 - 14:26:37 EST
Hello,
On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>
> > The scenario that happens in selftests/vm/userfaultfd is as follows:
> >
> > cpu0 cpu1 cpu2
> > ---- ---- ----
> > [ Writable PTE
> > cached in TLB ]
> > userfaultfd_writeprotect()
> > [ write-*unprotect* ]
> > mwriteprotect_range()
> > mmap_read_lock()
> > change_protection()
> >
> > change_protection_range()
> > ...
> > change_pte_range()
> > [ *clear* “write”-bit ]
> > [ defer TLB flushes ]
> > [ page-fault ]
> > ...
> > wp_page_copy()
> > cow_user_page()
> > [ copy page ]
> > [ write to old
> > page ]
> > ...
> > set_pte_at_notify()
>
> Yuck!
>
Note, the above was posted before we figured out the details so it
wasn't showing the real deferred tlb flush that caused problems (the
one showed on the left causes zero issues).
The problematic one not pictured is the one of the wrprotect that has
to be running in another CPU which is also isn't picture above. More
accurate traces are posted later in the thread.
> Isn't this all rather similar to the problem that resulted in the
> tlb_flush_pending mess?
>
> I still think that's all fundamentally buggered, the much saner solution
> (IMO) would've been to make things wait for the pending flush, instead
How do intend you wait in PT lock while the writer also has to take PT
lock repeatedly before it can do wake_up_var?
If you release the PT lock before calling wait_tlb_flush_pending it
all falls apart again.
This I guess explains why a local pte/hugepmd smp local invlpg is the
only working solution for this issue, similarly to how it's done in rmap.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 07d9acb5b19c..0210547ac424 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> *
> * Therefore we must rely on tlb_flush_*() to guarantee order.
> */
> - atomic_dec(&mm->tlb_flush_pending);
> + if (atomic_dec_and_test(&mm->tlb_flush_pending))
> + wake_up_var(&mm->tlb_flush_pending);
> }
>
> static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> return atomic_read(&mm->tlb_flush_pending) > 1;
> }
>
> +static inline void wait_tlb_flush_pending(struct mm_struct *mm)
> +{
> + wait_var_event(&mm->tlb_flush_pending,
> + atomic_read(&mm->tlb_flush_pending) == 0);
> +}
I appreciate the effort in not regressing soft dirty and uffd-wp
writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for
writing.
At the same time what was posted so far wasn't clean enough but it
wasn't even tested... if we abstract it in some clean way and we mark
all connected points (soft dirty, uffd-wp, the wrprotect page fault),
then I can be optimistic it will remain understandable when we look at
it again a few years down the road.
Or at the very least it can't get worse than the "tlb_flush_pending
mess" you mentioned above.
flush_tlb_batched_pending() has to be orthogonally re-reviewed for
those things Nadav pointed out. But I'd rather keep that review in a
separate thread since any bug in that code has zero connection to this
issue. The basic idea is similar but the methods and logic are
different and our flush here will be granular and it's going to be
only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or
if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide,
unconditional etc.. Pretty much all different.
Thanks,
Andrea