Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Yu Zhao
Date: Sun Dec 20 2020 - 23:45:26 EST


On Sun, Dec 20, 2020 at 07:33:09PM -0800, Nadav Amit wrote:
> > On Dec 20, 2020, at 1:54 AM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> >>> On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >>>
> >>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
> >>>> [ cc’ing some more people who have experience with similar problems ]
> >>>>
> >>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>>>> Analyzing this problem indicates that there is a real bug since
> >>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>>>>
> >>>>> Never having to take the mmap_sem for writing, and in turn never
> >>>>> blocking, in order to modify the pagetables is quite an important
> >>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>>>> most important reason to use uffd, but it'd be nice if that guarantee
> >>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>>>> other pgtable manipulations.
> >>>>>
> >>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>>>
> >>>>>> cpu0 cpu1
> >>>>>> ---- ----
> >>>>>> userfaultfd_writeprotect()
> >>>>>> [ write-protecting ]
> >>>>>> mwriteprotect_range()
> >>>>>> mmap_read_lock()
> >>>>>> change_protection()
> >>>>>> change_protection_range()
> >>>>>> ...
> >>>>>> change_pte_range()
> >>>>>> [ defer TLB flushes]
> >>>>>> userfaultfd_writeprotect()
> >>>>>> mmap_read_lock()
> >>>>>> change_protection()
> >>>>>> [ write-unprotect ]
> >>>>>> ...
> >>>>>> [ unprotect PTE logically ]
> >>>>>> ...
> >>>>>> [ page-fault]
> >>>>>> ...
> >>>>>> wp_page_copy()
> >>>>>> [ set new writable page in PTE]
> >>>
> >>> I don't see any problem in this example -- wp_page_copy() calls
> >>> ptep_clear_flush_notify(), which should take care of the stale entry
> >>> left by cpu0.
> >>>
> >>> That being said, I suspect the memory corruption you observed is
> >>> related this example, with cpu1 running something else that flushes
> >>> conditionally depending on pte_write().
> >>>
> >>> Do you know which type of pages were corrupted? file, anon, etc.
> >>
> >> First, Yu, you are correct. My analysis is incorrect, but let me have
> >> another try (below). To answer your (and Andrea’s) question - this happens
> >> with upstream without any changes, excluding a small fix to the selftest,
> >> since it failed (got stuck) due to missing wake events. [1]
> >>
> >> We are talking about anon memory.
> >>
> >> So to correct myself, I think that what I really encountered was actually
> >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> >> problem was that in this case the “write”-bit was removed during unprotect.
> >
> > Thanks. You are right about when the problem happens: UFD write-
> > UNprotecting. But it's not UFD write-UNprotecting that removes the
> > writable bit -- the bit can only be removed during COW or UFD
> > write-protecting. So your original example was almost correct, except
> > the last line describing cpu1.
>
> The scenario is a bit confusing, so stay with me. The idea behind uffd
> unprotect is indeed only to mark the PTE logically as uffd-unprotected, and
> not to *set* the writable bit, allowing the #PF handler to do COW or
> whatever correctly upon #PF.

Right.

> However, the problem that we have is that if a page is already writable,
> write-unprotect *clears* the writable bit, making it write-protected (at
> least for anonymous pages). This is not good from performance point-of-view,
> but also a correctness issue, as I pointed out.
>
> In some more detail: mwriteprotect_range() uses vm_get_page_prot() to
> compute the new protection. For anonymous private memory, at least on x86,
> this means the write-bit in the protection is clear. So later,
> change_pte_range() *clears* the write-bit during *unprotection*.

Agreed.

> That’s the reason the second part of my patch - the change to preserve_write
> - fixes the problem.

Yes, it fixes a part of the problem.

But what if there is no writable bit there for you to preserve? If the
bit was cleared by COW, e.g., KSM, fork, etc., no problem, because they
guarantee the flush. If it was cleared by a priory UFD write-protecting,
you still would run into the same problem because of the deterred flush.
And you are trying to fix this part of the problem by grabbing
mmap_write_lock. I think I understand your patch correctly.

Both parts of the problem were introduced by the commits I listed, and
your patch does fix the problem. I'm just saying it's not an optimal fix
because for the second part of the problem:
1) there is no need to grab mmap_write_lock
2) there is no need to make a copy in do_wp_page() if the write bit was
cleared by UFD write-protecting (non-COW case).

The fix should be done in do_wp_page(), i.e., to handle non-COW pages
correctly. Preserving the write bit can be considered separately as an
optimization, not a fix. (It eliminates unnecessary page faults.)

> > The problem is how do_wp_page() handles non-COW pages. (For COW pages,
> > do_wp_page() works correctly by either reusing an existing page or
> > make a new copy out of it.) In UFD case, the existing page may not
> > have been properly write-protected. As you pointed out, the tlb flush
> > may not be done yet. Making a copy can potentially race with the
> > writer on cpu2.
>
> Just to clarify the difference - You regard a scenario of UFFD
> write-protect, while I am pretty sure the problem I encountered is during
> write-unprotect.
>
> I am not sure we are on the same page (but we may be). The problem I have is
> with cow_user_page() that is called by do_wp_page() before any TLB flush
> took place (either by change_protection_range() or by do_wp_page() which
> does flush, but after the copy).
>
> Let me know if you regard a different scenario.
>
> > Should we fix the problem by ensuring integrity of the copy? IMO, no,
> > because do_wp_page() shouldn't copy at all in this case. It seems it
> > was recently broken by
> >
> > be068f29034f mm: fix misplaced unlock_page in do_wp_page()
> > 09854ba94c6a mm: do_wp_page() simplification
> >
> > I haven't study them carefully. But if you could just revert them and
> > run the test again, we'd know where exactly to look at next.
>
> These patches regard the wp_page_reuse() case, which makes me think we
> are not on the same page. I do not see a problem with wp_page_reuse()
> since it does not make a copy of the page. If you can explain what I
> am missing, it would be great.
>