Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeingthe wrong VMA information

From: Mel Gorman
Date: Wed Apr 28 2010 - 11:56:21 EST


On Wed, Apr 28, 2010 at 05:35:25PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 10:15:55AM +0100, Mel Gorman wrote:
> > It became unconditional because I wasn't sure of the optimisation versus the
> > new anon_vma changes (doesn't matter, should have been safe). At the time
>
> Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
> the removal of the lock but I don't get it from the comments. Or at
> least I don't get from that comment why we can't resurrect the plain
> old deleted code that looked fine to me.

Frankly, I don't understand why it was safe to drop the lock either.
Maybe it was a mistake but I still haven't convinced myself I fully
understand the subtleties of the anon_vma changes.

> Like there is no reason to
> take the lock if start == vma->vm_start.
>
> > So, the VMA list does not appear to be messed up but there still needs
> > to be protection against modification of VMA details that are already on
> > the list. For that, the seq counter would have been enough and
> > lighter-weight than acquiring the anon_vma->lock every time in
> > vma_adjust().
> >
> > I'll drop this patch again as the execve race looks the most important.
>
> You mean you're dropping patch 2 too?

Temporarily at least until I figured out if execve was the only problem. The
locking in vma_adjust didn't look like the prime reason for the crash
but the lack of locking there is still very odd.

> I agree dropping patch 1 but
> to me the having to take all the anon_vma locks for every
> vma->anon_vma->lock that we walk seems a must, otherwise
> expand_downwards and vma_adjust won't be ok, plus we need to re-add
> the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> and vm_start outside of the anon_vma->lock. Or I am mistaken?
>

No, you're not. If nothing else, vma_address can return the wrong value because
the VMAs vm_start and vm_pgoff were in the process of being updated but not
fully updated. It's hard to see how vma_address would return the wrong value
and miss a migration PTE as a result but it's a possibility. It's probably
a lot more important for transparent hugepage support.

> Patch 2 wouldn't help the swapops crash we reproduced because at that
> point the anon_vma of the stack is the local one, it's just after
> execve.
>
> vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> taking only the vma->anon_vma->lock where the vma->anon_vma is the
> _local_ one of the vma.

True, although in the case of expand_downwards, it's highly unlikely that
there is also a migration PTE to be cleaned up. It's hard to see how a
migration PTE would be left behind in that case but it still looks wrong to
be updating the VMA fields without locking.

> But a vma in mainline can be indexed in
> infinite anon_vmas, so to prevent breaking migration
> vma_adjust/expand_downards the rmap_walk would need to take _all_
> anon_vma->locks for every anon_vma that the vma is indexed into.

I felt this would be too heavy in the common case which is why I made
rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
in far less critical paths.

> Or
> alternatively like you implemented rmap_walk would need to check if
> the vma we found in the rmap_walk is different from the original
> anon_vma and to take the vma->anon_vma->lock (so taking the
> anon_vma->lock of the local anon_vma of every vma) before it can
> actually read the vma->vm_pgoff/vm_start inside vma_address.
>

To be absolutly sure, yes this is required. I don't think we've been hitting
this exact problem in these tests but it still is a better plan than adjusting
VMA details without locks.

> If the above is right it also means the new anon-vma changes also break
> the whole locking of transparent hugepage, see wait_split_huge_page,
> it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> "local" anon-vma is enough, when in fact the hugepage may be shared
> and belonging to the parent parent_vma->anon_vma and not to the local
> one of the last child that is waiting on the wrong lock. So I may have
> to rewrite this part of the thp locking to solve this. And for me it's
> not enough to just taking more locks inside the rmap walks inside
> split_huge_page as I used the anon_vma lock outside too.
>

No fun. That potentially could be a lot of locks to take to split the
page.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/