Re: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1

From: Michal Hocko
Date: Wed Feb 17 2016 - 08:32:49 EST


On Wed 17-02-16 19:36:36, Tetsuo Handa wrote:
> >From 0b36864d4100ecbdcaa2fc2d1927c9e270f1b629 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Date: Wed, 17 Feb 2016 16:37:59 +0900
> Subject: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
>
> Currently, out_of_memory() does not wait for existing TIF_MEMDIE threads
> if /proc/sys/vm/oom_kill_allocating_task is set to 1. This can result in
> killing more OOM victims than needed. We can wait for the OOM reaper to
> reap memory used by existing TIF_MEMDIE threads if possible. If the OOM
> reaper is not available, the system will be kept OOM stalled until an
> OOM-unkillable thread does a GFP_FS allocation request and calls
> oom_kill_allocating_task == 0 path.
>
> This patch changes oom_kill_allocating_task == 1 case to call
> select_bad_process() in order to wait for existing TIF_MEMDIE threads.

The primary motivation for oom_kill_allocating_task was to reduce the
overhead of select_bad_process. See fe071d7e8aae ("oom: add
oom_kill_allocating_task sysctl"). So this basically defeats the whole
purpose of the feature.

I am not user of this knob because it behaves absolutely randomly but
IMHO we should simply do something like the following. It would be more
compliant to the documentation and prevent from livelock which is
currently possible (albeit very unlikely) when a single task consimes
all the memory reserves and we keep looping over out_of_memory without
any progress.

But as I've said I have no idea whether somebody relies on the current
behavior so this is more of a thinking loudly than proposing an actual
patch at this point of time.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 078e07ec0906..7de84fb2dd03 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -706,6 +706,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);

+ if (sysctl_oom_kill_allocating_task)
+ goto kill;
+
/*
* If any of p's children has a different mm and is eligible for kill,
* the one with the highest oom_badness() score is sacrificed for its
@@ -734,6 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
}
read_unlock(&tasklist_lock);

+kill:
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -888,6 +892,9 @@ bool out_of_memory(struct oom_control *oc)
if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+ if (test_thread_flag(TIF_MEMDIE))
+ panic("Out of memory (oom_kill_allocating_task) not able to make a forward progress");
+
get_task_struct(current);
oom_kill_process(oc, current, 0, totalpages, NULL,
"Out of memory (oom_kill_allocating_task)");
--
Michal Hocko
SUSE Labs