Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Linus Torvalds
Date: Wed Dec 23 2020 - 04:47:16 EST


On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> The more I look at the mprotect code, the less I like it. We seem to
> be much better about the TLB flushes in other places (looking at
> mremap, for example). The mprotect code seems to be very laissez-faire
> about the TLB flushing.

No, this doesn't help.

> Does adding a TLB flush to before that
>
> pte_unmap_unlock(pte - 1, ptl);
>
> fix things for you?

It really doesn't fix it. Exactly because - as pointed out earlier -
the actual page *copy* happens outside the pte lock.

So what can happen is:

- CPU 1 holds the page table lock, while doing the write protect. It
has cleared the writable bit, but hasn't flushed the TLB's yet

- CPU 2 did *not* have the TLB entry, sees the new read-only state,
takes a COW page fault, and reads the PTE from memory (into
vmf->orig_pte)

- CPU 2 correctly decides it needs to be a COW, and copies the page contents

- CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
happened yet), and writes to that page in users apce

- CPU 1 now does the TLB invalidate, and releases the page table lock

- CPU 2 gets the page table lock, sees that its PTE matches
vmf->orig_pte, and switches it to be that writable copy of the page.

where the copy happened before CPU 3 had stopped writing to the page.

So the pte lock doesn't actually matter, unless we actually do the
page copy inside of it (on CPU2), in addition to doing the TLB flush
inside of it (on CPU1).

mprotect() is actually safe for two independent reasons: (a) it does
the mmap_sem for writing (so mprotect can't race with the COW logic at
all), and (b) it changes the vma permissions so turning something
read-only actually disables COW anyway, since it won't be a COW, it
will be a SIGSEGV.

So mprotect() is irrelevant, other than the fact that it shares some
code with that "turn it read-only in the page tables".

fork() is a much closer operation, in that it actually triggers that
COW behavior, but fork() takes the mmap_sem for writing, so it avoids
this too.

So it's really just userfaultfd and that kind of ilk that is relevant
here, I think. But that "you need to flush the TLB before releasing
the page table lock" was not true (well, it's true in other
circumstances - just not *here*), and is not part of the solution.

Or rather, if it's part of the solution here, it would have to be
matched with that "page copy needs to be done under the page table
lock too".

Linus