Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Mon Jan 04 2021 - 16:03:30 EST


On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
> > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >
> > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
> >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >>>
> >>> Hello,
> >>>
> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >>>>
> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>>>>
> >>>>> cpu0 cpu1 cpu2
> >>>>> ---- ---- ----
> >>>>> [ Writable PTE
> >>>>> cached in TLB ]
> >>>>> userfaultfd_writeprotect()
> >>>>> [ write-*unprotect* ]
> >>>>> mwriteprotect_range()
> >>>>> mmap_read_lock()
> >>>>> change_protection()
> >>>>>
> >>>>> change_protection_range()
> >>>>> ...
> >>>>> change_pte_range()
> >>>>> [ *clear* “write”-bit ]
> >>>>> [ defer TLB flushes ]
> >>>>> [ page-fault ]
> >>>>> ...
> >>>>> wp_page_copy()
> >>>>> cow_user_page()
> >>>>> [ copy page ]
> >>>>> [ write to old
> >>>>> page ]
> >>>>> ...
> >>>>> set_pte_at_notify()
> >>>>
> >>>> Yuck!
> >>>
> >>> Note, the above was posted before we figured out the details so it
> >>> wasn't showing the real deferred tlb flush that caused problems (the
> >>> one showed on the left causes zero issues).
> >>
> >> Actually it was posted after (note that this is v2). The aforementioned
> >> scenario that Peter regards to is the one that I actually encountered (not
> >> the second scenario that is “theoretical”). This scenario that Peter regards
> >> is indeed more “stupid” in the sense that we should just not write-protect
> >> the PTE on userfaultfd write-unprotect.
> >>
> >> Let me know if I made any mistake in the description.
> >
> > I didn't say there is a mistake. I said it is not showing the real
> > deferred tlb flush that cause problems.
> >
> > The issue here is that we have a "defer tlb flush" that runs after
> > "write to old page".
> >
> > If you look at the above, you're induced to think the "defer tlb
> > flush" that causes issues is the one in cpu0. It's not. That is
> > totally harmless.
>
> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
> the one that causes the problem. The PTE is write-protected (although it is
> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
> the page-fault (and copy), while cpu2 keeps writing to the source page. If
> cpu0 did not defer the TLB flush, this problem would not happen.

Your argument "If cpu0 did not defer the TLB flush, this problem would
not happen" is identical to "if the cpu0 has a small TLB size and the
tlb entry is recycled, the problem would not happen".

There are a multitude of factors that are unrelated to the real
problematic deferred tlb flush that may happen and still won't cause
the issue, including a parallel IPI.

The point is that we don't need to worry about the "defer TLB flushes"
of the un-wrprotect, because you said earlier that deferring tlb
flushes when you're doing "permission promotions" does not cause
problems.

The only "deferred tlb flush" we need to worry about, not in the
picture, is the one following the actual permission removal (the
wrprotection).

> it shows the write that triggers the corruption instead of discussing
> “windows”, which might be less clear. Running copy_user_page() with stale

I think showing exactly where the race window opens is key to
understand the issue, but then that's the way I work and feel free to
think it in any other way that may sound simpler.

I just worried people thinks the deferred tlb flush in your v2 trace
is the one that causes problem when obviously it's not since it
follows a permission promotion. Once that is clear, feel free to
reject my trace.

All I care about is that performance don't regress from CPU-speed to
disk I/O spindle speed, for soft dirty and uffd-wp.

> I am not married to my description and if you (and others) insist I would
> copy-paste your version.

I definitely don't insist, I only wanted to clarify in case it may not
have been clear the problematic deferred tlb flush wasn't part of your
trace.

> Are you talking about the first scenario (write-unprotect), the second one
> (write-protect followed by write-unprotect), both? It seems to me that all
> the deferred TLB flushes are mentioned at the point they are deferred. I can
> add the point in which they are performed.

The only case that has an issue for uffd-wp is in my trace and only
ever happens if there's a wrprotect in flight, the deferred tlb flush
of the wrprotect is deferred (and that's the problematic one that
closes the window when it finally runs) and un-wrprotect runs. The
window opens when the un-wrprotect unlocks the PT lock. The deferred
tlb flush of un-wrprotect is as relevant for this race, as random tlb
flushes from IPI or the TLB being small or none.

Thanks,
Andrea