Re: [PATCH v2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED

From: David Hildenbrand
Date: Thu Mar 02 2023 - 12:39:20 EST


On 02.03.23 18:19, Muhammad Usama Anjum wrote:
On 2/28/23 5:36 AM, Peter Xu wrote:
On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
This is a new feature that controls how uffd-wp handles none ptes. When
it's set, the kernel will handle anonymous memory the same way as file
memory, by allowing the user to wr-protect unpopulated ptes.

File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not. For
anonymous it was not as persistent because we used to assume that we don't
need protections on none ptes or known zero pages.

One use case of such a feature bit was VM live snapshot, where if without
wr-protecting empty ptes the snapshot can contain random rubbish in the
holes of the anonymous memory, which can cause misbehave of the guest when
the guest OS assumes the pages should be all zeros.

QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].

Recently there's another need raised on using userfaultfd wr-protect for
detecting dirty pages (to replace soft-dirty in some cases) [2]. In that
case if without being able to wr-protect none ptes by default, the dirty
info can get lost, since we cannot treat every none pte to be dirty (the
current design is identify a page dirty based on uffd-wp bit being cleared).

In general, we want to be able to wr-protect empty ptes too even for
anonymous.

This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
uffd-wp handling on none ptes being consistent no matter what the memory
type is underneath. It doesn't have any impact on file memories so far
because we already have pte markers taking care of that. So it only
affects anonymous.

The feature bit is by default off, so the old behavior will be maintained.
Sometimes it may be wanted because the wr-protect of none ptes will contain
overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
anonymous), but also on creating the pgtables to store the pte markers. So
there's potentially less chance of using thp on the first fault for a none
pmd or larger than a pmd.

The major implementation part is teaching the whole kernel to understand
pte markers even for anonymously mapped ranges, meanwhile allowing the
UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
new feature bit is set.

Note that even if the patch subject starts with mm/uffd, there're a few
small refactors to major mm path of handling anonymous page faults. But
they should be straightforward.

So far, add a very light smoke test within the userfaultfd kselftest
pagemap unit test to make sure anon pte markers work.

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@xxxxxxxxxxxxx/
[1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/

Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
---
v1->v2:
- Use pte markers rather than populate zero pages when protect [David]
- Rename WP_ZEROPAGE to WP_UNPOPULATED [David]

Some very initial performance numbers (I only ran in a VM but it should be
similar, unit is "us") below as requested. The measurement is about time
spent when wr-protecting 10G range of empty but mapped memory. It's done
in a VM, assuming we'll get similar results on bare metal.

Four test cases:

- default UFFDIO_WP
- pre-read the memory, then UFFDIO_WP (what QEMU does right now)
- pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
- UFFDIO_WP with WP_UNPOPULATED

Results:

Test DEFAULT: 2
Test PRE-READ: 3277099 (pre-fault 3253826)
Test MADVISE: 2250361 (pre-fault 2226310)
Test WP-UNPOPULATE: 20850
In your case:
Default < WP-UNPOPULATE < MADVISE < PRE-READ


In my testing on next-20230228 with this patch and my uffd async patch:

Test DEFAULT: 6
Test PRE-READ: 37157 (pre-fault 37006)
Test MADVISE: 4884 (pre-fault 4465)
Test WP-UNPOPULATE: 17794

DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ

On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
I'm not sure why I'm getting this discrepancy here. I've liked your results
to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
can be done to get consistent benchmarks over your and my side?

Probably because the current approach from Peter uses uffd-wp markers, and these markers can currently only reside on the PTE level, not on the PMD level yet.

With MADVISE you get a huge zeropage and avoid dealing with PTEs.

--
Thanks,

David / dhildenb