[PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

From: Thomas HellstrÃm (VMware)
Date: Tue Oct 08 2019 - 05:15:53 EST


From: Thomas Hellstrom <thellstrom@xxxxxxxxxx>

The pagewalk code was unconditionally splitting transhuge pmds when a
pte_entry was present. However ideally we'd want to handle transhuge pmds
in the pmd_entry function and ptes in pte_entry function. So don't split
huge pmds when there is a pmd_entry function present, but let the callback
take care of it if necessary.

In order to make sure a virtual address range is handled by one and only
one callback, and since pmd entries may be unstable, we introduce a
pmd_entry return code that tells the walk code to continue processing this
pmd entry rather than to move on. Since caller-defined positive return
codes (up to 2) are used by current callers, use a high value that allows a
large range of positive caller-defined return codes for future users.

Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Cc: Will Deacon <will.deacon@xxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Huang Ying <ying.huang@xxxxxxxxx>
Cc: JÃrÃme Glisse <jglisse@xxxxxxxxxx>
Cc: Kirill A. Shutemov <kirill@xxxxxxxxxxxxx>
Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Thomas Hellstrom <thellstrom@xxxxxxxxxx>
---
include/linux/pagewalk.h | 8 ++++++++
mm/pagewalk.c | 28 +++++++++++++++++++++-------
2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index bddd9759bab9..c4a013eb445d 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -4,6 +4,11 @@

#include <linux/mm.h>

+/* Highest positive pmd_entry caller-specific return value */
+#define PAGE_WALK_CALLER_MAX (INT_MAX / 2)
+/* The handler did not handle the entry. Fall back to the next level */
+#define PAGE_WALK_FALLBACK (PAGE_WALK_CALLER_MAX + 1)
+
struct mm_walk;

/**
@@ -16,6 +21,9 @@ struct mm_walk;
* this handler is required to be able to handle
* pmd_trans_huge() pmds. They may simply choose to
* split_huge_page() instead of handling it explicitly.
+ * If the handler did not handle the PMD, or split the
+ * PMD and wants it handled by the PTE handler, it
+ * should return PAGE_WALK_FALLBACK.
* @pte_entry: if set, called for each non-empty PTE (4th-level) entry
* @pte_hole: if set, called for each hole at all levels
* @hugetlb_entry: if set, called for each hugetlb entry
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 83c0b78363b4..f844c2a2aa60 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -50,10 +50,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
* This implies that each ->pmd_entry() handler
* needs to know about pmd_trans_huge() pmds
*/
- if (ops->pmd_entry)
+ if (ops->pmd_entry) {
err = ops->pmd_entry(pmd, addr, next, walk);
- if (err)
- break;
+ if (!err)
+ continue;
+ else if (err <= PAGE_WALK_CALLER_MAX)
+ break;
+ WARN_ON(err != PAGE_WALK_FALLBACK);
+ err = 0;
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ /* Fall through */
+ }

/*
* Check this here so we only break down trans_huge
@@ -61,8 +69,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
*/
if (!ops->pte_entry)
continue;
-
- split_huge_pmd(walk->vma, pmd, addr);
+ if (!ops->pmd_entry)
+ split_huge_pmd(walk->vma, pmd, addr);
if (pmd_trans_unstable(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
@@ -281,11 +289,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
*
* - 0 : succeeded to handle the current entry, and if you don't reach the
* end address yet, continue to walk.
- * - >0 : succeeded to handle the current entry, and return to the caller
- * with caller specific value.
+ * - >0, and <= PAGE_WALK_CALLER_MAX : succeeded to handle the current entry,
+ * and return to the caller with caller specific value.
* - <0 : failed to handle the current entry, and return to the caller
* with error code.
*
+ * For pmd_entry(), a value <= PAGE_WALK_CALLER_MAX indicates that the entry
+ * was handled by the callback. PAGE_WALK_FALLBACK indicates that the entry
+ * could not be handled by the callback and should be re-checked. If the
+ * callback needs the entry to be handled by the next level, it should
+ * split the entry and then return PAGE_WALK_FALLBACK.
+ *
* Before starting to walk page table, some callers want to check whether
* they really want to walk over the current vma, typically by checking
* its vm_flags. walk_page_test() and @ops->test_walk() are used for this
--
2.21.0