Re: [PATCH v16 11/25] mm: pagewalk: Add p4d_entry() and pgd_entry()

From: Thomas HellstrÃm (VMware)
Date: Thu Dec 12 2019 - 06:33:51 EST


On 12/12/19 12:23 PM, Thomas HellstrÃm (VMware) wrote:
On 12/6/19 2:53 PM, Steven Price wrote:
pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
no users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.

Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
PUD-sized transparent hugepages") already re-added pud_entry() but with
different semantics to the other callbacks. Since there have never
been upstream users of this, revert the semantics back to match the
other callbacks. This means pud_entry() is called for all entries, not
just transparent huge pages.

Actually, there are two users of pud_entry(), in hmm.c and since 5.5rc1 also mapping_dirty_helpers.c. The latter one is unproblematic and requires no attention but the one in hmm.c is probably largely untested, and seems to assume it was called outside of the spinlock.

The problem with the current patch is that the hmm pud_entry will traverse also pmds, so that will be done twice now.

In another thread we were discussing a means of rerunning the level (in case of a race), or continuing after a level, based on the return value after the callback. The change was fairly invasive,

Hmm. Forgot to remove the above text that appears twice. :(. The correct one is inline below.


Tested-by: Zong Li <zong.li@xxxxxxxxxx>
Signed-off-by: Steven Price <steven.price@xxxxxxx>
---
 include/linux/pagewalk.h | 19 +++++++++++++------
 mm/pagewalk.c | 27 ++++++++++++++++-----------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 6ec82e92c87f..06790f23957f 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -8,15 +8,15 @@ struct mm_walk;
  /**
ÂÂ * mm_walk_ops - callbacks for walk_page_range
- * @pud_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PUD (2nd-level) entry
- *ÂÂÂÂÂÂÂÂÂÂÂ this handler should only handle pud_trans_huge() puds.
- *ÂÂÂÂÂÂÂÂÂÂÂ the pmd_entry or pte_entry callbacks will be used for
- *ÂÂÂÂÂÂÂÂÂÂÂ regular PUDs.
- * @pmd_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PMD (3rd-level) entry
+ * @pgd_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PGD (top-level) entry
+ * @p4d_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty P4D entry
+ * @pud_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PUD entry
+ * @pmd_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PMD entry
ÂÂ *ÂÂÂÂÂÂÂÂÂÂÂ this handler is required to be able to handle
 * pmd_trans_huge() pmds. They may simply choose to
ÂÂ *ÂÂÂÂÂÂÂÂÂÂÂ split_huge_page() instead of handling it explicitly.
- * @pte_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PTE (4th-level) entry
+ * @pte_entry:ÂÂÂÂÂÂÂ if set, called for each non-empty PTE (lowest-level)
+ *ÂÂÂÂÂÂÂÂÂÂÂ entry
ÂÂ * @pte_hole:ÂÂÂÂÂÂÂ if set, called for each hole at all levels
ÂÂ * @hugetlb_entry:ÂÂÂ if set, called for each hugetlb entry
ÂÂ * @test_walk:ÂÂÂÂÂÂÂ caller specific callback function to determine whether
@@ -27,8 +27,15 @@ struct mm_walk;
ÂÂ * @pre_vma:ÂÂÂÂÂÂÂÂÂÂÂ if set, called before starting walk on a non-null vma.
ÂÂ * @post_vma:ÂÂÂÂÂÂÂÂÂÂ if set, called after a walk on a non-null vma, provided
ÂÂ *ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ that @pre_vma and the vma walk succeeded.
+ *
+ * p?d_entry callbacks are called even if those levels are folded on a
+ * particular architecture/configuration.
ÂÂ */
 struct mm_walk_ops {
+ÂÂÂ int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
+ÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long next, struct mm_walk *walk);
+ÂÂÂ int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
+ÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long next, struct mm_walk *walk);
ÂÂÂÂÂ int (*pud_entry)(pud_t *pud, unsigned long addr,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long next, struct mm_walk *walk);
ÂÂÂÂÂ int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ea0b9e606ad1..c089786e7a7f 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -94,15 +94,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
ÂÂÂÂÂÂÂÂÂ }
 Â if (ops->pud_entry) {
-ÂÂÂÂÂÂÂÂÂÂÂ spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
-
-ÂÂÂÂÂÂÂÂÂÂÂ if (ptl) {
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ err = ops->pud_entry(pud, addr, next, walk);
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ spin_unlock(ptl);
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (err)
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ break;
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ continue;
-ÂÂÂÂÂÂÂÂÂÂÂ }
+ÂÂÂÂÂÂÂÂÂÂÂ err = ops->pud_entry(pud, addr, next, walk);
+ÂÂÂÂÂÂÂÂÂÂÂ if (err)
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ break;

Actually, there are two current users of pud_entry(), in hmm.c and since 5.5rc1 also mapping_dirty_helpers.c. The latter one is unproblematic and requires no attention but the one in hmm.c is probably largely untested, and seems to assume it was called outside of the spinlock.

The problem with the current patch is that the hmm pud_entry will traverse also pmds, so that will now be done twice.

/Thomas