Re: [RFC PATCH] mm: swap: reuse exclusive folio directly instead of wp page faults

From: David Hildenbrand
Date: Fri May 31 2024 - 08:10:50 EST


On 31.05.24 13:55, Barry Song wrote:
On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 31.05.24 12:48, Barry Song wrote:
From: Barry Song <v-songbaohua@xxxxxxxx>

After swapping out, we perform a swap-in operation. If we first read
and then write, we encounter a major fault in do_swap_page for reading,
along with additional minor faults in do_wp_page for writing. However,
the latter appears to be unnecessary and inefficient. Instead, we can
directly reuse in do_swap_page and completely eliminate the need for
do_wp_page.

This patch achieves that optimization specifically for exclusive folios.
The following microbenchmark demonstrates the significant reduction in
minor faults.

#define DATA_SIZE (2UL * 1024 * 1024)
#define PAGE_SIZE (4UL * 1024)

static void *read_write_data(char *addr)
{
char tmp;

for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) {
tmp = *(volatile char *)(addr + i);
*(volatile char *)(addr + i) = tmp;
}
}

int main(int argc, char **argv)
{
struct rusage ru;

char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
memset(addr, 0x11, DATA_SIZE);

do {
long old_ru_minflt, old_ru_majflt;
long new_ru_minflt, new_ru_majflt;

madvise(addr, DATA_SIZE, MADV_PAGEOUT);

getrusage(RUSAGE_SELF, &ru);
old_ru_minflt = ru.ru_minflt;
old_ru_majflt = ru.ru_majflt;

read_write_data(addr);
getrusage(RUSAGE_SELF, &ru);
new_ru_minflt = ru.ru_minflt;
new_ru_majflt = ru.ru_majflt;

printf("minor faults:%ld major faults:%ld\n",
new_ru_minflt - old_ru_minflt,
new_ru_majflt - old_ru_majflt);
} while(0);

return 0;
}

w/o patch,
/ # ~/a.out
minor faults:512 major faults:512

w/ patch,
/ # ~/a.out
minor faults:0 major faults:512

Minor faults decrease to 0!

Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
---
mm/memory.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..e1d2e339958e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
if (!folio_test_ksm(folio) &&
(exclusive || folio_ref_count(folio) == 1)) {
- if (vmf->flags & FAULT_FLAG_WRITE) {
- pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- vmf->flags &= ~FAULT_FLAG_WRITE;
+ if (vma->vm_flags & VM_WRITE) {
+ pte = pte_mkwrite(pte_mkdirty(pte), vma);
+ if (vmf->flags & FAULT_FLAG_WRITE)
+ vmf->flags &= ~FAULT_FLAG_WRITE;

This implies, that even on a read fault, you would mark the pte dirty
and it would have to be written back to swap if still in the swap cache
and only read.

That is controversial.

What is less controversial is doing what mprotect() via
change_pte_range()/can_change_pte_writable() would do: mark the PTE
writable but not dirty.

I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set.

Thanks!

I assume you mean something as below?

It raises an important point: uffd-wp must be handled accordingly.


diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..dbf1ba8ccfd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);

+ if (pte_swp_soft_dirty(vmf->orig_pte))
+ pte = pte_mksoft_dirty(pte);
+ if (pte_swp_uffd_wp(vmf->orig_pte))
+ pte = pte_mkuffd_wp(pte);
/*
* Same logic as in do_wp_page(); however, optimize for pages that are
* certainly not shared either because we just allocated them without
@@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
if (!folio_test_ksm(folio) &&
(exclusive || folio_ref_count(folio) == 1)) {
- if (vmf->flags & FAULT_FLAG_WRITE) {
- pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- vmf->flags &= ~FAULT_FLAG_WRITE;
+ if (vma->vm_flags & VM_WRITE) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ pte = pte_mkwrite(pte_mkdirty(pte), vma);
+ vmf->flags &= ~FAULT_FLAG_WRITE;
+ } else if ((!vma_soft_dirty_enabled(vma) ||
pte_soft_dirty(pte))
+ && !userfaultfd_pte_wp(vma, pte)) {
+ pte = pte_mkwrite(pte, vma);

Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a pte_mkwrite(pte). So we have to catch and handle that earlier (I could have sworn we handle that somehow).

Note that the existing
pte = pte_mkuffd_wp(pte);

Will fix that up because it does an implicit pte_wrprotect().


So maybe what would work is


if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
!vma_soft_dirty_enabled(vma)) {
pte = pte_mkwrite(pte);

/* Only set the PTE dirty on write fault. */
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = pte_mkdirty(pte);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
}

--
Cheers,

David / dhildenb