Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

From: Linus Torvalds
Date: Wed Oct 09 2019 - 20:19:28 EST


On Wed, Oct 9, 2019 at 4:51 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> (a) right now nobody wants the "skip" behavior. You think you'll
> eventually want it
>
> (b) right now, the "return positive value" is actually a horribly
> ugly pointless hack, which could be made to be an error value and
> cleaned up in the process
>
> (c) to me that really argues that we should just make the rule be
>
> - negative error means break out with error
>
> - 0 means continue down the next level
>
> - positive could be trivially be made to just mean "ok, I did it,
> you can just continue".
>
> and I think that would make a lot of sense.

So here's an ENTIRELY untested patch, but the return value of
pmd_entry() now makes conceptual sense to me. The whole "I hit an
error, I did nothing, I already did it myself" to me is the intuitive
meaning of {neg,0,pos} handling.

I think we probably should do this same thing for the upper levels too
to be consistent, but I think this at least makes sense, and is
simple, and avoids any hacky PAGE_WALK_CALLER_MAX magic.

I also wonder if some caller might want to get a count of "how many X
handled", and we'd just sum up all the positive return values as we're
traversing things, but that falls under "nobody seems to want it right
now", so I'm not adding extra code for something that might not be
useful.

And it is possible that I missed some other pmd_entry() callback that
returns a positive value. I did check them all, but mistakes happen
and I might have missed some case...

Again: entirely and utterly untested. It compiles. That's all I'm
going to guarantee, and even that might be a fluke.

Linus
mm/mempolicy.c | 10 +++++-----
mm/pagewalk.c | 11 ++++++++---
2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..f8c99591592b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -482,7 +482,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_pages_pte_range() has three possible return values:
* 0 - pages are placed on the right node or queued successfully.
- * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * -EBUSY - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
* specified.
* -EIO - only MPOL_MF_STRICT was specified and an existing page was already
* on a node that does not follow the policy.
@@ -669,7 +669,7 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
* passed via @private.
*
* queue_pages_range() has three possible return values:
- * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * -EBUSY - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
* specified.
* 0 - queue pages successfully or no misplaced page.
* -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
@@ -1285,8 +1285,8 @@ static long do_mbind(unsigned long start, unsigned long len,
ret = queue_pages_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);

- if (ret < 0) {
- err = -EIO;
+ if (ret < 0 && ret != -EBUSY) {
+ err = ret;
goto up_out;
}

@@ -1303,7 +1303,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}

- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if ((ret < 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
err = -EIO;
} else
putback_movable_pages(&pagelist);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..eb9d292588a2 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -49,10 +49,15 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
* This implies that each ->pmd_entry() handler
* needs to know about pmd_trans_huge() pmds
*/
- if (ops->pmd_entry)
+ if (ops->pmd_entry) {
err = ops->pmd_entry(pmd, addr, next, walk);
- if (err)
- break;
+ if (err < 0)
+ break;
+ if (err > 0) {
+ err = 0;
+ continue;
+ }
+ }

/*
* Check this here so we only break down trans_huge