[PATCH] memcg: Do not hang on OOM when killed by userspace OOM

From: Michal Hocko
Date: Wed Jul 31 2013 - 02:48:54 EST


Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly. The only way out is to
echo 0 > $GROUP/memory.oom_controll

His usecase is:
- Setup a hierarchy with memory and the freezer
(disable kernel oom and have a process watch for oom).
- In that memory cgroup add a process with one thread per cpu.
- In one thread slowly allocate once per second I think it is 16M of ram
and mlock and dirty it (just to force the pages into ram and stay there).
- When oom is achieved loop:
* attempt to freeze all of the tasks.
* if frozen send every task SIGKILL, unfreeze, remove the directory in
cgroupfs.

Eric has then pinpointed the issue to be memcg specific.

All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out. The tricky part is that that exit path might
trigger a page fault (e.g. exit_robust_list) thus the memcg charge
while its memcg is still under OOM because nobody has released any
charges. Unlike with the in-kernel OOM handler the exiting task doesn't
get TIF_MEMDIE set so it doesn't shortcut charges and falls to the
memcg OOM again without any way out of it as there are no fatal signals
pending anymore.

This patch sets the TIF_MEMDIE flag pro actively in mem_cgroup_handle_oom
if the memcg is disabled after the task is woken up with fatal signal
pending. This means that any further charges will be bypassed early in
__mem_cgroup_try_charge and the task will have chance to exit finally.

Strictly speaking we might mark also a task which hasn't been killed by
userspace OOM handler but this is not harmful as the task is going away
anyway and under-oom group would like to see it go as soon as possible.

Reported-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
Debugged-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
---
mm/memcontrol.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d12ca6f..d4103b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2235,8 +2235,19 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,

mem_cgroup_unmark_under_oom(memcg);

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ if (test_thread_flag(TIF_MEMDIE))
return false;
+
+ /*
+ * Userspace OOM killer might have killed this task but
+ * there is no way it could have set TIF_MEMDIE as well
+ * so we have to set it manually.
+ */
+ if (fatal_signal_pending(current)) {
+ if (memcg->oom_kill_disable)
+ set_thread_flag(TIF_MEMDIE);
+ return false;
+ }
/* Give chance to dying process */
schedule_timeout_uninterruptible(1);
return true;
--
1.8.3.2

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/