Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Mon Jan 04 2021 - 15:21:31 EST


On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
> > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >
> > Hello,
> >
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >>
> >>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>>
> >>> cpu0 cpu1 cpu2
> >>> ---- ---- ----
> >>> [ Writable PTE
> >>> cached in TLB ]
> >>> userfaultfd_writeprotect()
> >>> [ write-*unprotect* ]
> >>> mwriteprotect_range()
> >>> mmap_read_lock()
> >>> change_protection()
> >>>
> >>> change_protection_range()
> >>> ...
> >>> change_pte_range()
> >>> [ *clear* “write”-bit ]
> >>> [ defer TLB flushes ]
> >>> [ page-fault ]
> >>> ...
> >>> wp_page_copy()
> >>> cow_user_page()
> >>> [ copy page ]
> >>> [ write to old
> >>> page ]
> >>> ...
> >>> set_pte_at_notify()
> >>
> >> Yuck!
> >
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
>
> Actually it was posted after (note that this is v2). The aforementioned
> scenario that Peter regards to is the one that I actually encountered (not
> the second scenario that is “theoretical”). This scenario that Peter regards
> is indeed more “stupid” in the sense that we should just not write-protect
> the PTE on userfaultfd write-unprotect.
>
> Let me know if I made any mistake in the description.

I didn't say there is a mistake. I said it is not showing the real
deferred tlb flush that cause problems.

The issue here is that we have a "defer tlb flush" that runs after
"write to old page".

If you look at the above, you're induced to think the "defer tlb
flush" that causes issues is the one in cpu0. It's not. That is
totally harmless.


>
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
>
> I think I included this scenario as well in the commit log (of v2). Let me
> know if I screwed up and the description is not clear.

Instead of not showing the real "defer tlb flush" in the trace and
then fixing it up in the comment, why don't you take the trace showing
the real problematic "defer tlb flush"? No need to reinvent it.

https://lkml.kernel.org/r/X+JJqK91plkBVisG@xxxxxxxxxx

See here the detail underlined:

deferred tlb flush <- too late
XXXXXXXXXXXXXX BUG RACE window close here

This show the real deferred tlb flush, your v2 does not include it
instead.