Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Tue Dec 22 2020 - 17:04:28 EST


On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote:
> This works but I don't prefer this option because 1) this is new
> way of making pte_wrprotect safe and 2) it's specific to ufd and
> can't be applied to clear_soft_dirty() which has no "catcher". No

I didn't look into clear_soft_dirty issue, I can look into that.

To avoid having to deal with a 3rd solution it will have to use one of
the two:

1) avoid deferring tlb flush and enforce a sync flush before dropping
the PT lock even if mm_mm_tlb_flush_pending is true ->
write_protect_page in KSM

2) add its own new catcher for its own specific marker (for UFFD-WP
the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a
vma->vm_pgprot not PROT_NONE, for soft dirty it could be
_PAGE_SOFT_DIRTY) to send the page fault to a dead end before the
pte value is interpreted.

> matter how good the documentation about this new way is now, it
> will be out of date, speaking from my personal experience.

A common catcher for all 3 is not possible because each catcher
depends on whatever marker and whatever pte value they set that may
lead to a different deterministic path where to put the catcher or
multiple paths even. do_numa_page requires a catcher in a different
place already.

Or well, a common catcher for all 3 is technically possible but it'd
perform slower requiring to check things twice.

But perhaps the soft_dirty can use the same catcher of uffd-wp given
the similarity?

> I'd go with what Nadav has -- the memory corruption problem has been
> there for a while and nobody has complained except Nadav. This makes
> me think people is less likely to notice any performance issues from
> holding mmap_lock for write in his patch either.

uffd-wp is a fairly new feature, the current users are irrelevant,
keeping it optimal is just for the future potential.

> But I can't say I have zero concern with the potential performance
> impact given that I have been expecting the fix to go to stable,
> which I care most. So the next option on my list is to have a

Actually stable would be very fine to go with Nadav patch and use the
mmap_lock_write unconditionally. The optimal fix is only relevant for
the future potential, so it's only relevant for Linus's tree.

However the feature is recent enough that it won't require a deep
backport so the optimal fix is likely fine for stable as well,
generally stable prefers the same fix as in the upstream when there's
no major backport rejection issue.

The alternative solution for uffd is to do the deferred flush under
mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell
change_protection not to defer the flush and to take the
mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the
catcher and then userfaultfd_writeprotect(mode_wp=true)
userfaultfd_writeprotect(mode_wp=false) can even run in parallel at
all times. The cons is large userfaultfd_writeprotect will block for
I/O and those would happen at least in the postcopy live snapshotting
use case.

The main cons is that it'd require modification to change_protection
so it actually looks more intrusive, not less.

Overall anything that allows to wrprotect 1 pte with only the
mmap_lock_read exactly like KSM write_protect_page, would be enough for
uffd-wp.

What isn't ok in terms of future potential is unconditional
mmap_lock_write as in the original suggested patch in my view. It
doesn't mean we can take mmap_lock_write when the operation is large
and there is actually more benefit from deferring the flush.

> common "catcher" in do_wp_page() which singles out pages that have
> page_mapcount equal to one and reuse them instead of trying to

I don't think the page_mapcount matters here. If the wp page reuse was
more accurate (as it was before) we wouldn't notice this issue, but it
still would be a bug that there were stale TLB entries. It worked by
luck.

Thanks,
Andrea