Re: [RFC PATCH] mm, oom: introduce oom reaper

From: Michal Hocko
Date: Thu Nov 26 2015 - 06:08:59 EST


On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> Hi Michal,
>
> I think whatever we end up doing to smoothen things for the "common
> case" (as much as OOM kills can be considered common), we need a plan
> to resolve the memory deadlock situations in a finite amount of time.
>
> Eventually we have to attempt killing another task. Or kill all of
> them to save the kernel.
>
> It just strikes me as odd to start with smoothening the common case,
> rather than making it functionally correct first.

I believe there is not an universally correct solution for this
problem. OOM killer is a heuristic and a destructive one so I think we
should limit it as much as possible. I do agree that we should allow an
administrator to define a policy when things go terribly wrong - e.g.
panic/emerg. reboot after the system is trashing on OOM for more than
a defined amount of time. But I think that this is orthogonal to this
patch. This patch should remove one large class of potential deadlocks
and corner cases without too much cost or maintenance burden. It doesn't
remove a need for the last resort solution though.

> On Wed, Nov 25, 2015 at 04:56:58PM +0100, Michal Hocko wrote:
> > A kernel thread has been chosen because we need a reliable way of
> > invocation so workqueue context is not appropriate because all the
> > workers might be busy (e.g. allocating memory). Kswapd which sounds
> > like another good fit is not appropriate as well because it might get
> > blocked on locks during reclaim as well.
>
> Why not do it directly from the allocating context? I.e. when entering
> the OOM killer and finding a lingering TIF_MEMDIE from a previous kill
> just reap its memory directly then and there. It's not like the
> allocating task has anything else to do in the meantime...

One reason is that we have to exclude race with exit_mmap so we have to
increase mm_users but we cannot mmput in this context because we might
deadlock. So we have to tear down from a different context. Another
reason is that address space of the victim might be really large and
reaping from on behalf of one (random) task might be really unfair
wrt. others. Doing that from a kernel threads sounds like an easy and
relatively cheap way to workaround both issues.

>
> > @@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > continue;
> > }
> > /* If details->check_mapping, we leave swap entries. */
> > - if (unlikely(details))
> > + if (unlikely(details || !details->check_swap_entries))
> > continue;
>
> &&

Ups, thanks for catching this! I was playing with the condition and
rearranged the code multiple times before posting.

Thanks!
---
diff --git a/mm/memory.c b/mm/memory.c
index 4750d7e942a3..49cafa195527 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
continue;
}
- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details || !details->check_swap_entries))
+ /* only check swap_entries if explicitly asked for in details */
+ if (unlikely(details && !details->check_swap_entries))
continue;

entry = pte_to_swp_entry(ptent);
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/