[PATCH 0/10 -v3] Handle oom bypass more gracefully

From: Michal Hocko
Date: Fri Jun 03 2016 - 05:16:58 EST

this is the third version of the patchse. Previous version was posted
I have folded in all the fixes pointed by Oleg (thanks). I hope I
haven't missed anything.

The following 10 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally. Even though mm
shared outside of thread group is rare (either vforked tasks for a
short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it
makes the current oom killer logic quite hard to follow and evaluate it
can lead to weird corner cases. E.g. it is possible to select an oom
victim which shares the mm with unkillable process or bypass the oom
killer even when other processes sharing the mm are still alive and
other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory
work for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group updating oom_score_adj will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here though. There were some
suggestions to hook some counter/oom_score_adj into the mm_struct
but I feel that this is not necessary right now and we can rely on
proc handler + oom_kill_process to DTRT. I can be convinced otherwise
but I strongly think that whatever we do the userspace has to have
a way to see the current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out
by 0-day OOM report where an oom victim was reaped several times.

Assuming there are no other bugs in those patches and no fundamental
opposition to this direction I think we should go on and merged them
to the mmomt tree and target the 4.8 merge window.

Finally the last 2 patches are sent as an RFC because I am still not sure
this direction is the correct one. Patch 9 puts an upper bound on how many
times oom_reaper tries to reap a task and hides it from the oom killer to
move on when no progress can be made. Patch 10 tries to plug the (hopefully)
last hole when we can still lock up when the oom victim is shared with
oom unkillable tasks (kthreads and global init). We just try to be best
effort in that case and rather fallback to kill something else than risk
a lockup.

The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I have pushed the patchset to my git tree
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch

Michal Hocko (10):
proc, oom: drop bogus task_lock and mm check
proc, oom: drop bogus sighand lock
proc, oom_adj: extract oom_score_adj setting into a helper
mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
mm, oom: skip vforked tasks from being selected
mm, oom: kill all tasks sharing the mm
mm, oom: fortify task_will_free_mem
mm, oom: task_will_free_mem should skip oom_reaped tasks
mm, oom_reaper: do not attempt to reap a task more than twice
mm, oom: hide mm which is shared with kthread or global init

fs/proc/base.c | 185 ++++++++++++++++++++++---------------------
include/linux/mm.h | 2 +
include/linux/oom.h | 26 +-----
include/linux/sched.h | 27 +++++++
mm/memcontrol.c | 4 +-
mm/oom_kill.c | 214 ++++++++++++++++++++++++++++++++++----------------
6 files changed, 278 insertions(+), 180 deletions(-)