Re: WARN_ON in move_normal_pmd

From: Joel Fernandes
Date: Sat Mar 25 2023 - 12:47:21 EST


On Sat, Mar 25, 2023 at 12:33 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi Linus,
>
> On Fri, Mar 24, 2023 at 04:38:03PM -0700, Linus Torvalds wrote:
> > On Fri, Mar 24, 2023 at 6:43 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Wouldn't it be better to instead fix it from the caller side? Like
> > > making it non-overlapping.
> >
> > I wonder if we could just do something like this in mremap() instead
> >
> > - if old/new are mutually PMD_ALIGNED
> >
> > - *and* there is no vma below new within the same PMD
> >
> > - then just expand the mremap to be PMD-aligned downwards
> >
> > IOW, the problem with the exec stack moving case isn't really that
> > it's overlapping: that part is fine. We're moving downwards, and we
> > start from the bottom, so the moving part works fine.
> >
> > No, the problem is that we *start* by moving individual pages, and
> > then by the time we've a few pages down by a whole PMD, we finish the
> > source PMD (and we've cleared all the contents of it), but it still
> > exists.
> >
> > And at *that* point, when we go and start copying the next page, we're
> > suddenly fully PMD-aligned, and now we try to copy a whole PMD, and
> > then that code is unhappy about the fact that the old (empty) PMD is
> > there in the target.
> >
>
> You are very right. I am able to also trigger the warning with a synthetic
> program that just mremaps a range and moves it in the same way that triggers
> this issue, however I had to hack the kernel to prevent mremap from erroring
> out if ranges overlap (unlike exec, mremap does some initial checks for
> that). Also had to do other hacks but I did reproduce it consistently.
>
> The issue is that even though the PMD is empty, it is allocated. So
> pmd_none() is kind of a lie in some sense, it is pointing to empty PTEs when
> the range is really empty.
>
> How about we replace the warning with something like the following?
>
> + if (unlikely(!pmd_none(*new_pmd))) {
> + // Check if any ptes in the pmd are non-empty. Doing this here
> + // is ok since this is not a fast path.
> + bool pmd_empty = true;
> + unsigned long tmp_addr = new_addr;
> + pte_t* check_pte = pte_offset_map(new_pmd, new_addr);
> +
> + new_ptl = pte_lockptr(mm, new_pmd);
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> + for (; tmp_addr < old_addr + PMD_SIZE; check_pte++, tmp_addr += PAGE_SIZE) {

Apologies, here I was going for "tmp_addr < new_addr + PMD_SIZE". I
made the change and it still works (This is just to show the basic
idea, I am still testing it).

thanks,

- Joel


> + if (!pte_none(*check_pte)) {
> + pmd_empty = false;
> + break;
> + }
> + }
>
> + WARN_ON_ONCE(!pmd_empty);
> + spin_unlock(new_ptl);
> + }
> +
>
> > And for all of this to happen, we need to move things by an exact
> > multiple of PMD size, because otherwise we'd never get to that aligned
> > situation at all, and we'd always do all the movement in individual
> > pages, and everything would be just fine.
> >
> > And more importantly, if we had just *started* with moving a whole
> > PMD, this also wouldn't have happened. But we didn't. We started
> > moving individual pages.
> >
> > So you could see the warning not as a "this range overlaps" warning
> > (it's fine, and happens all the time, and we do individual pages that
> > way quite happily), but really as a "hey, this was very inefficient -
> > you shouldn't have done those individual pages as several small
> > independent invidual pages in the first place" warning.
> >
>
> Exactly.
>
> > So some kind of
> >
> > /* Is the movement mutually PMD-aligned? */
> > if ((old_addr ^ new_addr) & ~PMD_MASK == 0) {
> > .. try to extend the move_vma() down to the *aligned*
> > PMD case ..
> > }
> >
>
> I actually didn't follow what you meant by "mutually PMD-aligned". Could you
> provide some example address numbers to explain?
>
> AFAIK, only 2MB aligned memory addresses can be directly addressed by a PMD.
> Otherwise how will you index the bytes in the 2MB page? You need bits in the
> address for that.
>
> > logic in move_page_tables() would get rid of the warning, and would
> > make the move more efficient since you'd skip the "move individual
> > pages and allocate a new PMD" case entirely.
> >
> > This is all fairly com,plicated, and the "try to extend the move
> > range" would also have to depend on CONFIG_HAVE_MOVE_PMD etc, so I'm
> > not saying it's trivial.
> >
> > But it would seem to be a really nice optimization, in addition to
> > getting rid of the warning.
> >
> > It could even help real world cases outside of this odd stack
> > remapping case if users ever end up moving vma's by multiples of
> > PMD_SIZE, and there aren't other vma's around the source/target that
> > disable the optimization.
> >
> > Hmm? Anybody want to look into that? It looks hairy enough that I
> > think that "you could test this with mutually aligned mremap()
> > source/targets in some test program" would be a good thing. Because
> > the pure execve() case is rare enough that using *that* as a test-case
> > seems like a fool's errand.
> >
>
> Just to mention, mremap errors out if you try to move overlapping ranges
> because this in mremap_to():
>
> /* Ensure the old/new locations do not overlap
> if (addr + old_len > new_addr && new_addr + new_len > addr) {
> pr_err("%s: (%s) (%d)", __func__, __FILE__, __LINE__);
> goto out;
> }
>
> Or is there an mremap trick I might be missing which actually allows
> overlapping range moves?
>
> thanks,
>
> - Joel
>
>