Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Nadav Amit
Date: Tue Dec 22 2020 - 16:15:44 EST


> On Dec 22, 2020, at 11:31 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Date: Tue, 22 Dec 2020 14:30:16 -0500
> Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
> userfaultfd_writeprotect()
>
> change_protection() is called by userfaultfd_writeprotect() with the
> mmap_lock_read like in change_prot_numa().
>
> The page fault code in wp_copy_page() rightfully assumes if the CPU
> issued a write fault and the write bit in the pagetable is not set, no
> CPU can write to the page. That's wrong assumption after
> userfaultfd_writeprotect(). That's also wrong assumption after
> change_prot_numa() where the regular page fault code also would assume
> that if the present bit is not set and the page fault is running,
> there should be no stale TLB entry, but there is still.
>
> So to stay safe, the page fault code must be prevented to run as long
> as long as the TLB flush remains pending. That is already achieved by
> the do_numa_page() path for change_prot_numa() and by the
> userfaultfd_pte_wp() path for userfaultfd_writeprotect().
>
> The problem that needs fixing is that an un-wrprotect
> (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not
> set) could run in between the original wrprotect
> (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set)
> and wp_copy_page, while the TLB flush remains pending.

I may need to read your patch more carefully, but in general I do not like
the approach. You are much more experienced than I am, but IMHO the TLB
flushing logic needs to be further simplified and generalized and not the
other way around.

The complexity is already too high. We have tlb_flush_batched and
tlb_flush_pending, which I think should be (somehow) combined.
tlb_gather_mmu() is designed for zapping, but can’t it be modified to suit
protection changes to avoid buggy use-cases (as the wrong use in
clear_refs_write() ) ?

So adding new userfaultfd specific code, which potentially does not address
all the interactions (now or the future), is concerning.

In this regard, a similar problem to the one in userfaultfd
(mmap_read_lock() while write-protecting) already exists with soft-dirty
clearing, so any solution should also address the soft-dirty issue.