Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

From: Nadav Amit
Date: Mon Jan 04 2021 - 16:27:29 EST


> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>>>
>>> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
>>>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>>>>>
>>>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>>>>>>
>>>>>>> cpu0 cpu1 cpu2
>>>>>>> ---- ---- ----
>>>>>>> [ Writable PTE
>>>>>>> cached in TLB ]
>>>>>>> userfaultfd_writeprotect()
>>>>>>> [ write-*unprotect* ]
>>>>>>> mwriteprotect_range()
>>>>>>> mmap_read_lock()
>>>>>>> change_protection()
>>>>>>>
>>>>>>> change_protection_range()
>>>>>>> ...
>>>>>>> change_pte_range()
>>>>>>> [ *clear* “write”-bit ]
>>>>>>> [ defer TLB flushes ]
>>>>>>> [ page-fault ]
>>>>>>> ...
>>>>>>> wp_page_copy()
>>>>>>> cow_user_page()
>>>>>>> [ copy page ]
>>>>>>> [ write to old
>>>>>>> page ]
>>>>>>> ...
>>>>>>> set_pte_at_notify()
>>>>>>
>>>>>> Yuck!
>>>>>
>>>>> Note, the above was posted before we figured out the details so it
>>>>> wasn't showing the real deferred tlb flush that caused problems (the
>>>>> one showed on the left causes zero issues).
>>>>
>>>> Actually it was posted after (note that this is v2). The aforementioned
>>>> scenario that Peter regards to is the one that I actually encountered (not
>>>> the second scenario that is “theoretical”). This scenario that Peter regards
>>>> is indeed more “stupid” in the sense that we should just not write-protect
>>>> the PTE on userfaultfd write-unprotect.
>>>>
>>>> Let me know if I made any mistake in the description.
>>>
>>> I didn't say there is a mistake. I said it is not showing the real
>>> deferred tlb flush that cause problems.
>>>
>>> The issue here is that we have a "defer tlb flush" that runs after
>>> "write to old page".
>>>
>>> If you look at the above, you're induced to think the "defer tlb
>>> flush" that causes issues is the one in cpu0. It's not. That is
>>> totally harmless.
>>
>> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
>> the one that causes the problem. The PTE is write-protected (although it is
>> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
>> the page-fault (and copy), while cpu2 keeps writing to the source page. If
>> cpu0 did not defer the TLB flush, this problem would not happen.
>
> Your argument "If cpu0 did not defer the TLB flush, this problem would
> not happen" is identical to "if the cpu0 has a small TLB size and the
> tlb entry is recycled, the problem would not happen".
>
> There are a multitude of factors that are unrelated to the real
> problematic deferred tlb flush that may happen and still won't cause
> the issue, including a parallel IPI.
>
> The point is that we don't need to worry about the "defer TLB flushes"
> of the un-wrprotect, because you said earlier that deferring tlb
> flushes when you're doing "permission promotions" does not cause
> problems.
>
> The only "deferred tlb flush" we need to worry about, not in the
> picture, is the one following the actual permission removal (the
> wrprotection).

I think you are missing the point of this scenario, which is different than
the second scenario.

In this scenario, change_pte_range(), when called to do userfaultfd’s
*unprotect* operation, did not preserve the write-bit if it was already set.
Instead change_pte_range() *cleared* the write-bit. So upon a logical
permission promotion operation - userfaultfd *unprotect* - you got a
physical permission demotion, turning RW PTEs into RO.

This problem is fully resolved by this part of the patch:

--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
- bool preserve_write = prot_numa && pte_write(oldpte);
+ bool preserve_write = (prot_numa || uffd_wp_resolve) &&
+ pte_write(oldpte);

You can argue that this not directly related to the deferred TLB flush, as
once this chunk is added, a TLB flush would not be needed at all for
userfaultfd-unprotect. But I consider it a part of the problem, especially
since this is what triggered the userfaultfd self-tests to fail.

>> it shows the write that triggers the corruption instead of discussing
>> “windows”, which might be less clear. Running copy_user_page() with stale
>
> I think showing exactly where the race window opens is key to
> understand the issue, but then that's the way I work and feel free to
> think it in any other way that may sound simpler.
>
> I just worried people thinks the deferred tlb flush in your v2 trace
> is the one that causes problem when obviously it's not since it
> follows a permission promotion. Once that is clear, feel free to
> reject my trace.
>
> All I care about is that performance don't regress from CPU-speed to
> disk I/O spindle speed, for soft dirty and uffd-wp.

I would feel more comfortable if you provide patches for uffd-wp. If you
want, I will do it, but I restate that I do not feel comfortable with this
solution (worried as it seems a bit ad-hoc and might leave out a scenario
we all missed or cause a TLB shootdown storm).

As for soft-dirty, I thought that you said that you do not see a better
(backportable) solution for soft-dirty. Correct me if I am wrong.

Anyhow, I will add your comments regarding the stale TLB window to make the
description clearer.