Re: WARNING: at mm/mremap.c:211 move_page_tables in i386

From: Linus Torvalds
Date: Sat Jul 11 2020 - 14:20:57 EST


On Sat, Jul 11, 2020 at 10:27 AM Naresh Kamboju
<naresh.kamboju@xxxxxxxxxx> wrote:
>
> I have started bisecting this problem and found the first bad commit

Thanks for the effort. Bisection is often a great tool to figure out
what's wrong.

Sadly, in this case:

> commit 9f132f7e145506efc0744426cb338b18a54afc3b
> Author: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> Date: Thu Jan 3 15:28:41 2019 -0800
>
> mm: select HAVE_MOVE_PMD on x86 for faster mremap

Yeah, that's just the commit that enables the code, not the commit
that introduces the fundamental problem.

That said, this is a prime example of why I absolutely detest patch
series that do this kind of thing, and are several patches that create
new functionality, followed by one patch to enable it.

If you can't get things working incrementally, maybe you shouldn't do
them at all. Doing a big series of "hidden work" and then enabling it
later is wrong.

In this case, though, the real patch that did the code isn't that kind
of "big series of hidden work" patch series, it's just the (single)
previous commit 2c91bd4a4e2e ("mm: speed up mremap by 20x on large
regions").

So your bisection is useful, it's just that it really points to that
previous commit, and it's where this code was introduced.

It's also worth noting that that commit doesn't really *break*
anything, since it just falls back to the old behavior when it warns.

So to "fix" your test-case, we could just remove the WARN_ON.

But the WARN_ON() is still worrisome, because the thing it tests for
really _should_ be true.

Now, we actually have a known bug in this area that is talked about
elsewhere: the way unmap does the pgtable_free() is

/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);

if (downgrade)
mmap_write_downgrade(mm);

unmap_region(mm, vma, prev, start, end);

(and unmap_region() is what does the pgtable_free() that should have
cleared the PMD).

And the problem with the "downgrade" is that another thread can change
the beginning of the next vma when it's a grow-down region (or the end
of the prev one if it's a grow-up).

See commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap") for the source of that

But that requires an _actual_ "unmap()" system call (the others set
"downgrade" to false - I'm not entirely sure why), and it requires
another thread to be attached to that VM in order to actually do that
racy concurrent stack size change.

And neither of those two cases will be true for the execve() path.
It's a new VM, with just one thread attached, so no threaded munmap()
going on there.

The fact that it seems to happen with

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c

makes me think it's somehow related to THP mappings, but I don't see
why those would matter. All the same pmd freeing should still have
happened, afaik.

And the printout I asked for a few days back for when it triggered
clearly showed a normal non-huge pmd ("val: 7d530067" is just
"accessed, dirty, rw, user and present", which is a perfectly normal
page directory entry for 4kB pages, and we could move the whole thing
and move 2MB (or 4MB) of aligned virtual memory in one go).

Some race with THP splitting and pgtable_free()? I can't see how
anything would race in execve(), or how anything would have touched
that address below the stack in the first place..

Kirill, Oleg, and reaction from this? Neither of you were on the
original email, I think, it's this one:

https://lore.kernel.org/lkml/CA+G9fYt+6OeibZMD0fP=O3nqFbcN3O4xcLkjq0mpQbZJ2uxB9w@xxxxxxxxxxxxxx/

and I really think it is harmless in that when the warning triggers,
we just go back to the page-by-page code, but while I think the
WARN_ON() should probably be downgraded to a WARN_ON_ONCE(), I do
think it's showing _something_.

I just can't see how this would trigger for execve(). That's just
about the _simplest_ case for us: single-threaded, mappings set up
purely by load_elf_binary() etc.

I'm clearly missing something.

Linus