Re: WARN_ON in move_normal_pmd

From: Joel Fernandes
Date: Sat Mar 25 2023 - 22:27:08 EST


On Sat, Mar 25, 2023 at 10:06:59AM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2023 at 9:33 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > I actually didn't follow what you meant by "mutually PMD-aligned". Could you
> > provide some example address numbers to explain?
>
> Sure, let me make this more clear with a couple of concrete examples.
>
> Let's say that we have a range '[old, old+len]' and we want to remap
> it to '[new, new+len]'.
>
> Furthermore, we'll say that overlapping is fine, but because we're
> always moving pages from "low address to high", we only allow
> overlapping when we're moving things down (ie 'new < old').
>
> And yes, I know that the overlapping case cannot actually happen with
> mremap() itself. So in practice the overlapping case only happens for
> the special "move the stack pages" around at execve() startup, but
> let's ignore that for now.
>
> So we'll talk about the generic "move pages around" case, not the more
> limited 'mremap()' case.
>
> I'll also simplify the thing to just assume that we have that
> CONFIG_HAVE_MOVE_PMD enabled, so I'll ignore some of the full grotty
> details.
>
> Ok?
[...]
>
> we could easily decode "let's just move the whole PMD", and expand the
> move to be
>
> old = 0x1e00000
> new = 0x1c00000
> len = 0x400000
> instead. And then instead of moving PTE's around at first, we'd move
> PMD's around *all* the time, and turn this into that "simple case
> (a)".

Right, I totally get what you mean. You want to move more than the 4k pages
in the beginning of the mapping. In fact the whole PMD, which extends further
below the destination to capture the full PMD that the first 4k pages are
located in. With that you get to just move PMDs purely all the way.

I think that is a great idea.

> NOTE! For this to work, there must be no mapping right below 'old' or
> 'new', of course. But during the execve() startup, that should be
> trivially true.

Exactly it wont work if there is something below old or new.

So for that very reason, we still have to handle the bad case where the
source PMD was not deleted right? Because if there is something below new,
you'll need to copy 1 PTE at a time till you hit the 2MB boundary, because
you can't mess with that source PMD, it is in use to satisfy mappings below
new. Then you'll eventually hit the warning we are discussing. I guess even
if one can assure that there is no mapping below new for the execve() case,
it still cannot be guaranteed for the mremap() case I think.

But I agree, if there is no mapping below old/new, then we can just do this
as an optimization. I think all that is needed to do is to check whether
there are any VMAs at those locations, but correct me if I'm wrong as I'm not
an mm expert.

> See what I'm saying?

Yep.

And as you pointed out in the mremap example, this issue can also show up
with non-overlapping ranges if I'm not mistaken.

I get your idea. Allow me to digest all this a bit more, and since it is not
urgent and this stuff is going to take some careful work with proper test
cases etc, let me take this up and work on it. But your idea is loud and
clear. I am also working on sending you that RCU PR and working hard to not
screw that up so it is a bit busy :-P. And thank you again for the great
idea and discussion! Looking forward to working on this.

thanks,

- Joel