Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Nadav Amit
Date: Sat Dec 19 2020 - 16:37:05 EST


[ cc’ing some more people who have experience with similar problems ]

> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>> Analyzing this problem indicates that there is a real bug since
>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>
> Never having to take the mmap_sem for writing, and in turn never
> blocking, in order to modify the pagetables is quite an important
> feature in uffd that justifies uffd instead of mprotect. It's not the
> most important reason to use uffd, but it'd be nice if that guarantee
> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> other pgtable manipulations.
>
>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>
>> cpu0 cpu1
>> ---- ----
>> userfaultfd_writeprotect()
>> [ write-protecting ]
>> mwriteprotect_range()
>> mmap_read_lock()
>> change_protection()
>> change_protection_range()
>> ...
>> change_pte_range()
>> [ defer TLB flushes]
>> userfaultfd_writeprotect()
>> mmap_read_lock()
>> change_protection()
>> [ write-unprotect ]
>> ...
>> [ unprotect PTE logically ]
>> ...
>> [ page-fault]
>> ...
>> wp_page_copy()
>> [ set new writable page in PTE]
>
> Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
> is set and do an explicit (potentially spurious) tlb flush before
> write-unprotect?

There is a concrete scenario that I actually encountered and then there is a
general problem.

In general, the kernel code assumes that PTEs that are read from the
page-tables are coherent across all the TLBs, excluding permission promotion
(i.e., the PTE may have higher permissions in the page-tables than those
that are cached in the TLBs).

We therefore need to both: (a) protect change_protection_range() from the
changes of others who might defer TLB flushes without taking mmap_sem for
write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
page-fault handlers) from concurrent changes of change_protection().

We have already encountered several similar bugs, and debugging such issues
s time consuming and these bugs impact is substantial (memory corruption,
security). So I think we should only stick to general solutions.

So perhaps your the approach of your proposed solution is feasible, but it
would have to be applied all over the place: we will need to add a check for
mm_tlb_flush_pending() and conditionally flush the TLB in every case in
which PTEs are read and there might be an assumption that the
access-permission reflect what the TLBs hold. This includes page-fault
handlers, but also NUMA migration code in change_protection(), softdirty
cleanup in clear_refs_write() and maybe others.

[ I have in mind another solution, such as keeping in each page-table a
“table-generation” which is the mm-generation at the time of the change,
and only flush if “table-generation”==“mm-generation”, but it requires
some thought on how to avoid adding new memory barriers. ]

IOW: I think the change that you suggest is insufficient, and a proper
solution is too intrusive for “stable".

As for performance, I can add another patch later to remove the TLB flush
that is unnecessarily performed during change_protection_range() that does
permission promotion. I know that your concern is about the “protect” case
but I cannot think of a good immediate solution that avoids taking mmap_lock
for write.

Thoughts?