Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Tue Dec 22 2020 - 22:03:04 EST


On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote:
> and 2) people are spearheading multiple efforts to reduce the mmap_lock
> contention, which hopefully would make ufd users suffer less soon.

In my view UFFD is an already deployed working solution that
eliminates the mmap_lock_write contention to allocate and free memory.

We need to add a UFFDIO_POPULATE to use in combination with
UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page or
THP and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE, but it
will solve the last performance bottleneck by avoiding a suprious
wrprotect fault after the allocation).

After that malloc based on uffd should become competitive single
threaded and it won't ever require the mmap_lock_write so allocations
and freeing of memory can continue indefinitely from all threaded in
parallel. There will never be another mmap or munmap stalling all
threads.

This is not why uffd was created, it's just a secondary performance
benefit of uffd, but it's still a relevant benefit in my view.

Every time I hear people with major mmap_lock_write issues I recommend
uffd, but you know, until we add the UFFDIO_POPULATE, it will still
have higher fixed allocation overhead because of the wprotect fault
after UFFDIO_ZEROCOPY. UFFDIO_COPY also would be not as optimal as a
clear_page and currently it's not even THP capable.

In addition you'll get a SIGBUS after an user after free. It's not
like when you have a malloc lib doing MADV_DONTNEED at PAGE_SIZE
granularity to rate limit the costly munmap, and then the app does an
use after free and it reads zero or writes to a newly faulted in page.

The above will not require any special privilege and all allocated
virtual memory remains fully swappable, because SIGBUS mode will never
have to block any kernel initiated faults.

uffd-wp also is totally usable unprivileged by default to replace
various software dirty bits with the info provided in O(1) instead of
O(N), as long as the writes are done in userland also unprivileged by
default without tweaking any sysctl and with zero risk of increasing
reproduciblity of any exploit against unrelated random kernel bugs.

So if we're forced to take the mmap_lock_write it'd be cool if at
least we can avoid it for 1 single pte or hugepmd wrprotection, as it
happens in write_protect_page() KSM.

Thanks,
Andrea