Re: WARN_ON in move_normal_pmd

From: Linus Torvalds
Date: Sat Mar 25 2023 - 13:07:23 EST


On Sat, Mar 25, 2023 at 9:33 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> I actually didn't follow what you meant by "mutually PMD-aligned". Could you
> provide some example address numbers to explain?

Sure, let me make this more clear with a couple of concrete examples.

Let's say that we have a range '[old, old+len]' and we want to remap
it to '[new, new+len]'.

Furthermore, we'll say that overlapping is fine, but because we're
always moving pages from "low address to high", we only allow
overlapping when we're moving things down (ie 'new < old').

And yes, I know that the overlapping case cannot actually happen with
mremap() itself. So in practice the overlapping case only happens for
the special "move the stack pages" around at execve() startup, but
let's ignore that for now.

So we'll talk about the generic "move pages around" case, not the more
limited 'mremap()' case.

I'll also simplify the thing to just assume that we have that
CONFIG_HAVE_MOVE_PMD enabled, so I'll ignore some of the full grotty
details.

Ok?

So let's take the simple case first:

(a) everything is PMD aligned, so all of old, new, and len are
multiples of the PMD size.

This is the simple case, because we can just do the whole move by
moving the PMD entries, and we'll never hit any issues. As we move the
PMD entries in move_normal_pmd(), we always remove the entry we are
moving down:

/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);

so as we move up in the destination, we will never hit a populated PMD
entry as we walk the page tables.

So (a) is fine today, everybody is happy, we have no issues. It's
efficient and simple.

The other simple case is

(b) we're *not* PMD-aligned, and everything is done one page at a time.

This is basically the exact same thing as (a), except rather than move
the PMD entry around, we move a PTE entry, and we do the exact same
"clear the entry as we move it" in move_ptes(), except (due to our
locking rules being different), it now looks like this instead:

pte = ptep_get_and_clear(mm, old_addr, old_pte);

but otherwise it's all the same as (a), just one level lower.

But then we have case (c): the "mutually aligned" case:

> AFAIK, only 2MB aligned memory addresses can be directly addressed by a PMD.
> Otherwise how will you index the bytes in the 2MB page? You need bits in the
> address for that.

The thing is, the problematic case is when 'old' and 'new' are not
initially themselves PMD-aligned (so they have lower bits set), but
they are *mutually* aligned, so they have the *same* lower bits set.

So in this case (c), we start off with case (b), but as we walk the
address one page at a time, at some point we suddenly hit case (a) in
the middle.

To make this really concrete, lets say that we have

old = 0x1fff000
new = 0x0fff000
len = 0x201000

and we have PMD_SIZE being 0x200000.

Notice how 'old' and 'new' are *not* PMD-aligned, but they are
*mutually* aligned in the PMD size (ie "old & (PMD_SIZE-1)" and "new &
(PMD_SIZE-1)" are the same).

Look at what happens: we start with the unaligned case, and we move
one single page, from address 0x1fff000 to 0x0fff000.

But what happens after we moved that page? We'll update old/new/len,
and now, we'll have

old = 0x2000000
new = 0x1000000
len = 0x200000

and now evertthing *is* PMD-aligned, and we just move a single PMD
entry from 0x2000000 to 0x1000000. And then we're done.

And that's fine, and won't cause any problems, because the area wasn't
overlapping at a PMD level, so we're all good.

But it it *was* overlapping, and we started out with

old = 0x1fff000
new = 1dff000
len = 0x201000

instead, we start out with the exact same thing, but after moving one
page, we'll have

old = 0x2000000
new = 0x1e00000
len = 0x200000

and now when we try to move the PMD entry from 0x2000000 to 0x1e00000,
we'll hit that *old* (and empty) PMD entry that used to contain that
one single page that we moved earlier.

And notice how for this all to happen, the old/new addresses have to
start out mutually aligned. So we can see the dangerous case up-front:
even if old and new aren't PMD-aligned to start with, you can see the
low bits are the same:

(old ^ new) & (PMD_SIZE-1) == 0

because as we move pages around, we'll always be updating old/new
together by the same amount.

So what I'm saying is that *if* we start out with that situation, and
we have that

old = 0x1fff000
new = 1dff000
len = 0x201000

we could easily decode "let's just move the whole PMD", and expand the
move to be

old = 0x1e00000
new = 0x1c00000
len = 0x400000

instead. And then instead of moving PTE's around at first, we'd move
PMD's around *all* the time, and turn this into that "simple case
(a)".

NOTE! For this to work, there must be no mapping right below 'old' or
'new', of course. But during the execve() startup, that should be
trivially true.

See what I'm saying?

Linus