Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

From: Jason Gunthorpe
Date: Thu Sep 17 2020 - 16:08:05 EST


On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote:
> (a) if the pinner is going to change the page, it will have to get
> the pin with FOLL_WRITE in addition to FOLL_PIN
>
> (b) that means it will do the COW and mark the page writable and dirty

Yep

> (c) the whole _point_ of the FOLL_PIN is that subsequent operations
> shouldn't make it non-writable any more (ie it can't be unmapped, and
> we should synchronize on fork etc)

It is the ideal, but FOLL_PIN has been troubled for a long time:
https://lwn.net/Articles/753027/

ie writeprotect is known to happen due to writeback. I had understood
that fork() will also cause write protect. Eg

copy_process()
copy_mm()
dup_mm()
dup_mmap()
copy_page_range()
[..]
copy_one_pte()

Gets to:

if (is_cow_mapping(vm_flags) && pte_write(pte)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}

Which blindly write protects a FOLL_PIN page. Here src_pte will be
under a pin.

I *think* the failing test is basically:

1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE)
2) pid = fork()
3) child: does a bit, then exec
4) parent: waitpid(pid)
5) *mem = 0

Here #5 is the WP triggered COW. We know a WP triggered COW is
happening from the bisect and success result with MADV_DONTFORK.

#2 triggers the dup_mmap() and the ptep_set_wrprotect() (From
inspection, at least)

This "Trial do_wp_page() simplification" patch means that when #5 goes
to do COW it gets a copy instead of a re-use because the
reuse_swap_page() was aborting the copy before.

So, to your point, yes ideally FOLL_PIN would never write-protect
pages!

Looking for awhile, this now looks reasonable and
doable. page_maybe_dma_pinned() was created for exactly this kind of
case.

I've attached a dumb sketch for the pte level (surely wrong! I have
never looked at this part of the mm before!) at the end of this
message.

Peter, do you know this better? Does this inspire you to make a patch? :)

Would really love to see this. We have such a huge expensive mess with
MADV_DONTFORK, this would eliminate all that overhead.

Thanks,
Jason

diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..6bc19a43da1391 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
#endif

+static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma,
+ unsigned long address, struct page *page)
+{
+ struct page *new_page;
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!new_page)
+ return -ENOMEM;
+ copy_user_highpage(new_page, page, address, vma);
+
+ /* FIXME: surely more than this */
+ *newpte = mk_pte(new_page, vma->vm_page_prot);
+ return 0;
+}
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ bool do_src_wp;

/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_set_pte;
}

- /*
- * If it's a COW mapping, write protect it both
- * in the parent and the child
- */
- if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
- pte = pte_wrprotect(pte);
- }
-
/*
* If it's a shared mapping, mark it clean in
* the child
@@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!(vm_flags & VM_UFFD_WP))
pte = pte_clear_uffd_wp(pte);

+ do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte);
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
page_dup_rmap(page, false);
rss[mm_counter(page)]++;
+
+ /*
+ * If a page is DMA pinned we never want to write protect it,
+ * copy it now.
+ */
+ if (do_src_wp && page_maybe_dma_pinned(page)) {
+ do_src_wp = false;
+ ret = duplicate_page(&pte, vma, addr, page);
+ if (ret)
+ /* FIXME: need to restructure a bit to handle this */
+ return ret;
+ }
+ }
+
+ /*
+ * If it's a COW mapping, write protect it both
+ * in the parent and the child
+ * FIXME check carefully this is new order is OK
+ */
+ if (do_src_wp) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ pte = pte_wrprotect(pte);
}

out_set_pte: