Re: [PATCH 3/5] mm: Rework return value for copy_one_pte()

From: Linus Torvalds
Date: Wed Sep 23 2020 - 13:22:50 EST


On Mon, Sep 21, 2020 at 2:18 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> There's one special path for copy_one_pte() with swap entries, in which
> add_swap_count_continuation(GFP_ATOMIC) might fail. In that case we'll return
> the swp_entry_t so that the caller will release the locks and redo the same
> thing with GFP_KERNEL.
>
> It's confusing when copy_one_pte() must return a swp_entry_t (even if all the
> ptes are non-swap entries). More importantly, we face other requirement to
> extend this "we need to do something else, but without the locks held" case.
>
> Rework the return value into something easier to understand, as defined in enum
> copy_mm_ret. We'll pass the swp_entry_t back using the newly introduced union
> copy_mm_data parameter.

Ok, I'm reading this series, and I do hate this.

And I think it's unnecessary.

There's a very simple way to avoid this all: split out the
"!pte_present(pte)" case from the function entirely.

That actually makes the code much more legible: that non-present case
is very different, and it's also unlikely() and causes deeper
indentation etc.

Because it's unlikely, it probably also shouldn't be inline.

That unlikely case is also why when then have that special
"out_set_pte" label, which should just go away and be copied into the
(now uninlined) function.

Once that re-organization has been done, the second step is to then
just move the "pte_present()" check into the caller, and suddenly all
the ugly return value games will go entirely away.

I'm attaching the two patches that do this here, but I do want to note
how that first patch is much more legible with "--ignore-all-space",
and then you really see that the diff is a _pure_ code movement thing.
Otherwise it looks like it's doing a big change.

Comments?

NOTE! The intent here is that now we can easily add new argument (a
pre-allocated page or NULL) and a return value to
"copy_present_page()": it can return "I needed a temporary page but
you hadn't allocated one yet" or "I used up the temporary page you
gave me" or "all good, keep the temporary page around for the future".

But these two patches are very intentionally meant to be just "this
clearly changes NO semantics at all".

Linus
From df3a57d1f6072d07978bafa7dbd9904cdf8f3e13 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Wed, 23 Sep 2020 09:56:59 -0700
Subject: [PATCH 1/2] mm: split out the non-present case from copy_one_pte()

This is a purely mechanical split of the copy_one_pte() function. It's
not immediately obvious when looking at the diff because of the
indentation change, but the way to see what is going on in this commit
is to use the "-w" flag to not show pure whitespace changes, and you see
how the first part of copy_one_pte() is simply lifted out into a
separate function.

And since the non-present case is marked unlikely, don't make the new
function be inlined. Not that gcc really seems to care, since it looks
like it will inline it anyway due to the whole "single callsite for
static function" logic. In fact, code generation with the function
split is almost identical to before. But not marking it inline is the
right thing to do.

This is pure prep-work and cleanup for subsequent changes.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
mm/memory.c | 152 ++++++++++++++++++++++++++++------------------------
1 file changed, 82 insertions(+), 70 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76..31a3ab7d9aa3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -695,6 +695,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
* covered by this vma.
*/

+static unsigned long
+copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
+ unsigned long addr, int *rss)
+{
+ unsigned long vm_flags = vma->vm_flags;
+ pte_t pte = *src_pte;
+ struct page *page;
+ swp_entry_t entry = pte_to_swp_entry(pte);
+
+ if (likely(!non_swap_entry(entry))) {
+ if (swap_duplicate(entry) < 0)
+ return entry.val;
+
+ /* make sure dst_mm is on swapoff's mmlist. */
+ if (unlikely(list_empty(&dst_mm->mmlist))) {
+ spin_lock(&mmlist_lock);
+ if (list_empty(&dst_mm->mmlist))
+ list_add(&dst_mm->mmlist,
+ &src_mm->mmlist);
+ spin_unlock(&mmlist_lock);
+ }
+ rss[MM_SWAPENTS]++;
+ } else if (is_migration_entry(entry)) {
+ page = migration_entry_to_page(entry);
+
+ rss[mm_counter(page)]++;
+
+ if (is_write_migration_entry(entry) &&
+ is_cow_mapping(vm_flags)) {
+ /*
+ * COW mappings require pages in both
+ * parent and child to be set to read.
+ */
+ make_migration_entry_read(&entry);
+ pte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(*src_pte))
+ pte = pte_swp_mksoft_dirty(pte);
+ if (pte_swp_uffd_wp(*src_pte))
+ pte = pte_swp_mkuffd_wp(pte);
+ set_pte_at(src_mm, addr, src_pte, pte);
+ }
+ } else if (is_device_private_entry(entry)) {
+ page = device_private_entry_to_page(entry);
+
+ /*
+ * Update rss count even for unaddressable pages, as
+ * they should treated just like normal pages in this
+ * respect.
+ *
+ * We will likely want to have some new rss counters
+ * for unaddressable pages, at some point. But for now
+ * keep things as they are.
+ */
+ get_page(page);
+ rss[mm_counter(page)]++;
+ page_dup_rmap(page, false);
+
+ /*
+ * We do not preserve soft-dirty information, because so
+ * far, checkpoint/restore is the only feature that
+ * requires that. And checkpoint/restore does not work
+ * when a device driver is involved (you cannot easily
+ * save and restore device driver state).
+ */
+ if (is_write_device_private_entry(entry) &&
+ is_cow_mapping(vm_flags)) {
+ make_device_private_entry_read(&entry);
+ pte = swp_entry_to_pte(entry);
+ if (pte_swp_uffd_wp(*src_pte))
+ pte = pte_swp_mkuffd_wp(pte);
+ set_pte_at(src_mm, addr, src_pte, pte);
+ }
+ }
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+}
+
static inline unsigned long
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
@@ -705,75 +783,10 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct page *page;

/* pte contains position in swap or file, so copy. */
- if (unlikely(!pte_present(pte))) {
- swp_entry_t entry = pte_to_swp_entry(pte);
-
- if (likely(!non_swap_entry(entry))) {
- if (swap_duplicate(entry) < 0)
- return entry.val;
-
- /* make sure dst_mm is on swapoff's mmlist. */
- if (unlikely(list_empty(&dst_mm->mmlist))) {
- spin_lock(&mmlist_lock);
- if (list_empty(&dst_mm->mmlist))
- list_add(&dst_mm->mmlist,
- &src_mm->mmlist);
- spin_unlock(&mmlist_lock);
- }
- rss[MM_SWAPENTS]++;
- } else if (is_migration_entry(entry)) {
- page = migration_entry_to_page(entry);
-
- rss[mm_counter(page)]++;
-
- if (is_write_migration_entry(entry) &&
- is_cow_mapping(vm_flags)) {
- /*
- * COW mappings require pages in both
- * parent and child to be set to read.
- */
- make_migration_entry_read(&entry);
- pte = swp_entry_to_pte(entry);
- if (pte_swp_soft_dirty(*src_pte))
- pte = pte_swp_mksoft_dirty(pte);
- if (pte_swp_uffd_wp(*src_pte))
- pte = pte_swp_mkuffd_wp(pte);
- set_pte_at(src_mm, addr, src_pte, pte);
- }
- } else if (is_device_private_entry(entry)) {
- page = device_private_entry_to_page(entry);
-
- /*
- * Update rss count even for unaddressable pages, as
- * they should treated just like normal pages in this
- * respect.
- *
- * We will likely want to have some new rss counters
- * for unaddressable pages, at some point. But for now
- * keep things as they are.
- */
- get_page(page);
- rss[mm_counter(page)]++;
- page_dup_rmap(page, false);
-
- /*
- * We do not preserve soft-dirty information, because so
- * far, checkpoint/restore is the only feature that
- * requires that. And checkpoint/restore does not work
- * when a device driver is involved (you cannot easily
- * save and restore device driver state).
- */
- if (is_write_device_private_entry(entry) &&
- is_cow_mapping(vm_flags)) {
- make_device_private_entry_read(&entry);
- pte = swp_entry_to_pte(entry);
- if (pte_swp_uffd_wp(*src_pte))
- pte = pte_swp_mkuffd_wp(pte);
- set_pte_at(src_mm, addr, src_pte, pte);
- }
- }
- goto out_set_pte;
- }
+ if (unlikely(!pte_present(pte)))
+ return copy_nonpresent_pte(dst_mm, src_mm,
+ dst_pte, src_pte, vma,
+ addr, rss);

/*
* If it's a COW mapping, write protect it both
@@ -807,7 +820,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
rss[mm_counter(page)]++;
}

-out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
}
--
2.28.0.218.gc12ef3d349

From 79a1971c5f14ea3a6e2b0c4caf73a1760db7cab8 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Wed, 23 Sep 2020 10:04:16 -0700
Subject: [PATCH 2/2] mm: move the copy_one_pte() pte_present check into the
caller

This completes the split of the non-present and present pte cases by
moving the check for the source pte being present into the single
caller, which also means that we clearly separate out the very different
return value case for a non-present pte.

The present pte case currently always succeeds.

This is a pure code re-organization with no semantic change: the intent
is to make it much easier to add a new return case to the present pte
case for when we do early COW at page table copy time.

This was split out from the previous commit simply to make it easy to
visually see that there were no semantic changes from this code
re-organization.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
mm/memory.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 31a3ab7d9aa3..e315b1f1ef08 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,8 +773,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return 0;
}

-static inline unsigned long
-copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+static inline void
+copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
unsigned long addr, int *rss)
{
@@ -782,12 +782,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t pte = *src_pte;
struct page *page;

- /* pte contains position in swap or file, so copy. */
- if (unlikely(!pte_present(pte)))
- return copy_nonpresent_pte(dst_mm, src_mm,
- dst_pte, src_pte, vma,
- addr, rss);
-
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
@@ -821,7 +815,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}

set_pte_at(dst_mm, addr, dst_pte, pte);
- return 0;
}

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -863,10 +856,17 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
progress++;
continue;
}
- entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+ if (unlikely(!pte_present(*src_pte))) {
+ entry.val = copy_nonpresent_pte(dst_mm, src_mm,
+ dst_pte, src_pte,
vma, addr, rss);
- if (entry.val)
- break;
+ if (entry.val)
+ break;
+ progress += 8;
+ continue;
+ }
+ copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
+ vma, addr, rss);
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

--
2.28.0.218.gc12ef3d349