Re: WARN_ON in move_normal_pmd

From: Linus Torvalds
Date: Sat Mar 25 2023 - 13:27:12 EST


On Sat, Mar 25, 2023 at 10:06 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> So what I'm saying is that *if* we start out with that situation, and
> we have that
>
> old = 0x1fff000
> new = 1dff000
> len = 0x201000
>
> we could easily decode "let's just move the whole PMD", and expand the
> move to be
>
> old = 0x1e00000
> new = 0x1c00000
> len = 0x400000
>
> instead. And then instead of moving PTE's around at first, we'd move
> PMD's around *all* the time, and turn this into that "simple case
> (a)".
>
> NOTE! For this to work, there must be no mapping right below 'old' or
> 'new', of course. But during the execve() startup, that should be
> trivially true.
>
> See what I'm saying?

Also note that my comments about "this can be tested with mremap()"
are because the above optimization works and is valid even when old
and new are not originally overlapping, but they overlap after the
expansion.

IOW, imagine that you have a 2GB mapping, but it is not 2GB-aligned
virtually, and you want to move that mapping down by 2GB.

Now, because that 2GB mapping is *not* 2GB-aligned, it actually takes
up *two* PMD entries. But if that mapping is the only thing that
exists in those two PMD entries, and the PMD entry below it is clear
(because there is no mapping right below the new address), then we can
still do that unaligned 2GB mapping move entirely at the PMD level.

So instead of wasting time to move it one page at a time (until it is
2GB aligned), we could just move two PMD entries around.

Here's a (UNTESTED! It compiles, but that's it) user test-case for
this situation:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <string.h>

/* Pick some random 2GB-aligned address that isn't near anything else */
#define GB (1ul << 20)
#define VA ((void *)(128 * GB))

#define old (VA+GB)
#define new (VA-GB)
#define len (2*GB)

int main(int argc, char **argv)
{
void *addr;

addr = mmap(old, len,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED,
-1, 0);
memset(addr, 0xff, len);
mremap(old, len, len,
MREMAP_MAYMOVE | MREMAP_FIXED, new);
return 0;
}

and I claim that that mremap() right now ends up doing the whole 2GB
page table move one page at a time, but it *should* be doable as just
two PMD entry moves.

See?

Linus