Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

From: Michal Hocko
Date: Fri Jul 28 2017 - 02:23:53 EST


On Thu 27-07-17 16:55:59, Andrea Arcangeli wrote:
> On Thu, Jul 27, 2017 at 08:50:24AM +0200, Michal Hocko wrote:
> > Yes this will work and it won't depend on the oom_lock. But isn't it
> > just more ugly than simply doing
> >
> > if (tsk_is_oom_victim) {
> > down_write(&mm->mmap_sem);
> > locked = true;
> > }
> > free_pgtables(...)
> > [...]
> > if (locked)
> > down_up(&mm->mmap_sem);
>
> To me not doing if (tsk_is_oom...) { down_write; up_write } is by
> default a confusing implementation, because it's not strict and not
> strict code is not self documenting and you've to think twice of why
> you're doing something the way you're doing it.

I disagree. But this is a matter of taste, I guess. I prefer when
locking is explicit and clear about the scope. An empty locked region
used for synchronization is just obscure and you have to think much more
what it means when you can see what the lock actually protects. It is
also less error prone I believe.

> The doubt on what was the point to hold the mmap_sem during
> free_pgtables is precisely why I started digging into this issue
> because it didn't look possible you could truly benefit from holding
> the mmap_sem during free_pgtables.

The point is that you cannot remove page tables while somebody is
walking them. This is enforced in all other paths except for exit which
was kind of special because nobody could race there.

> I also don't like having a new invariant that your solution relies on,
> that is mm->mmap = NULL, when we can make just set the MMF_OOM_SKIP a
> bit earlier that it gets set anyway and use that to control the other
> side of the race.

Well, again a matter of taste. I prefer making it clear that mm->mmap is
no longer valid because it points to a freed pointer. Relying on
MMF_OOM_SKIP for the handshake is just abusing the flag for a different
purpose than it was intended.

> I like strict code that uses as fewer invariants as possible and that
> never holds a lock for any instruction more than it is required (again
> purely for self documenting reasons, the CPU won't notice much one
> instruction more or less).
>
> Even with your patch the two branches are unnecessary, that may not be
> measurable, but it's still wasted CPU. It's all about setting mm->mmap
> before the up_write. In fact my patch should at least put an incremental
> unlikely around my single branch added to exit_mmap.
>
> I see the {down_write;up_write} Hugh's ksm_exit-like as a strict
> solution to this issue and I wrote it specifically while trying to
> research a way to be more strict because from the start it didn't look
> the holding of the mmap_sem during free_pgtables was necessary.

While we have discussed that with Hugh he has shown an interest to
actually get rid of this (peculiar) construct which would be possible if
the down_write was implicit in exit_mmap [1].

> I'm also fine to drop the oom_lock but I think it can be done
> incrementally as it's a separate issue, my second patch should allow
> for it with no adverse side effects.
>
> All I care about is the exit_mmap path because it runs too many times
> not to pay deep attention to every bit of it ;).

Well, all I care about is to fix this over-eager oom killing due to oom
reaper setting MMF_OOM_SKIP too early. That is a regression. I think we
can agree that both proposed solutions are functionally equivalent. I do
not want to push mine too hard so if you _believe_ that yours is really
better feel free to post it for inclusion.

[1] http://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@xxxxxxxxxxxx
--
Michal Hocko
SUSE Labs