Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Wed Dec 23 2020 - 18:06:51 EST


On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote:
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?

It certainly would since this is basically declaring "leaving stale
TLB entries past mmap_read_lock" is now permitted as long as the
pending flush counter is elevated under mmap_sem for reading.

Anything that prevents uffd-wp to take mmap_write_lock looks great to
me, anything, the below included, as long as it looks like easy to
enforce and understand. And the below certainly is.

My view is that the below is at the very least an improvement in terms
of total complexity, compared to v5.10. At least it'll be documented.

So what would be not ok to me is to depend on undocumented not
guaranteed behavior in do_wp_page like the page_mapcount, which is
what we had until now in clear_refs_write and in uffd-wp (but only if
wrprotect raced against un-wrprotect, a tiny window if compared to
clear_refs_write).

Documenting that clearing pte_write and deferring the flush is allowed
if mm_tlb_flush_pending was elevated before taking the PT lock is less
complex and very well defined rule, if compared to what we had before
in the page_mapcount dependency of clear_refs_write which was prone to
break, and in fact it just did.

We'll need a commentary in both userfaultfd_writeprotect and
clear_refs_write that links to the below snippet.

If we abstract it in a header we can hide there also a #ifdef
CONFIG_HAVE_ARCH_SOFT_DIRTY=y && CONFIG_HAVE_ARCH_USERFAULTFD_WP=y &&
CONFIG_USERFAULTFD=y to make it even more explicit.

However it may be simpler to keep it unconditional, I don't mind
either ways. If it was up to me I'd write it to those 3 config options
to be sure I remember where it comes from and to force any other user
to register to be explicit they depend on that.

>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> goto unlock;
> }
> if (vmf->flags & FAULT_FLAG_WRITE) {
> - if (!pte_write(entry))
> + if (!pte_write(entry)) {
> + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> + flush_tlb_page(vmf->vma, vmf->address);
> return do_wp_page(vmf);
> + }
> entry = pte_mkdirty(entry);
> }
> entry = pte_mkyoung(entry);
>