Re: [PATCH 2/5] oom reaper: handle mlocked pages

From: Hugh Dickins
Date: Sun Feb 28 2016 - 22:19:43 EST


On Tue, 23 Feb 2016, Michal Hocko wrote:
> On Mon 22-02-16 17:36:07, David Rientjes wrote:
> >
> > Are we concerned about munlock_vma_pages_all() taking lock_page() and
> > perhaps stalling forever, the same way it would stall in exit_mmap() for
> > VM_LOCKED vmas, if another thread has locked the same page and is doing an
> > allocation?
>
> This is a good question. I have checked for that particular case
> previously and managed to convinced myself that this is OK(ish).
> munlock_vma_pages_range locks only THP pages to prevent from the
> parallel split-up AFAICS.

I think you're mistaken on that: there is also the lock_page()
on every page in Phase 2 of __munlock_pagevec().

> And split_huge_page_to_list doesn't seem
> to depend on an allocation. It can block on anon_vma lock but I didn't
> see any allocation requests from there either. I might be missing
> something of course. Do you have any specific path in mind?
>
> > I'm wondering if in that case it would be better to do a
> > best-effort munlock_vma_pages_all() with trylock_page() and just give up
> > on releasing memory from that particular vma. In that case, there may be
> > other memory that can be freed with unmap_page_range() that would handle
> > this livelock.

I agree with David, that we ought to trylock_page() throughout munlock:
just so long as it gets to do the TestClearPageMlocked without demanding
page lock, the rest is the usual sugarcoating for accurate Mlocked stats,
and leave the rest for reclaim to fix up.

>
> I have tried to code it up but I am not really sure the whole churn is
> really worth it - unless I am missing something that would really make
> the THP case likely to hit in the real life.

Though I must have known about it forever, it was a shock to see all
those page locks demanded in exit, brought home to us a week or so ago.

The proximate cause in this case was my own change, to defer pte_alloc
to suit huge tmpfs: it had not previously occurred to me that I was
now doing the pte_alloc while __do_fault holds page lock. Bad Hugh.
But change not yet upstream, so not so urgent for you.

>From time immemorial, free_swap_and_cache() and free_swap_cache() only
ever trylock a page, precisely so that they never hold up munmap or exit
(well, if I looked harder, I might find lock ordering reasons too).

>
> Just for the reference this is what I came up with (just compile tested).

I tried something similar internally (on an earlier kernel). Like
you I've set that work aside for now, there were quicker ways to fix
the issue at hand. But it does continue to offend me that munlock
demands all those page locks: so if you don't get back to it before me,
I shall eventually.

I didn't understand why you complicated yours with the "enforce"
arg to munlock_vma_pages_range(): why not just trylock in all cases?

Hugh