Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather thancoutner

From: KAMEZAWA Hiroyuki
Date: Thu Jul 14 2011 - 07:57:39 EST

Next message: Phil Carmody: "[PATCH 1/1] trivial: Fix potentially dangerous trailing ';' in #defined values/expressions"
Previous message: Dave Chinner: "Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages inkswapd except in high priority"
In reply to: Michal Hocko: "Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather thancoutner"
Next in thread: Michal Hocko: "Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather thancoutner"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 14 Jul 2011 13:30:09 +0200
Michal Hocko <mhocko@xxxxxxx> wrote:

> What about this? Just compile tested:
> ---
> From 90ab974eb69c61c2e3b94beabe9b6745fa319936 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@xxxxxxx>
> Date: Wed, 13 Jul 2011 13:05:49 +0200
> Subject: [PATCH] memcg: make oom_lock 0 and 1 based rather than coutner
>
> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
>
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
>
> Consider a process (call it A) which gets the oom_lock (the first one
> that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> other processes are blocked on the mutex.
> While A releases the mutex and calls mem_cgroup_out_of_memory others
> will wake up (one after another) and increase the counter and fall into
> sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> takes the mutex again and decreases oom_lock and wakes other tasks (if
> releasing memory of the killed task hasn't done it yet).
> The main problem here is that everybody still race for the mutex and
> there is no guarantee that we will get counter back to 0 for those
> that got back to mem_cgroup_handle_oom. In the end the whole convoy
> in/decreases the counter but we do not get to 1 that would enable
> killing so nothing useful is going on.
> The time is basically unbounded because it highly depends on scheduling
> and ordering on mutex.
>
> This patch replaces the counter by a simple {un}lock semantic. We are
> using only 0 and 1 to distinguish those two states.
> As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
> to make sure that nobody else races with us which is guaranteed by the
> memcg_oom_mutex. All other consumers just read the value atomically for
> a single group which is sufficient because we set the value atomically.
> mem_cgroup_oom_lock has to be really careful because we might be in
> higher in a hierarchy than already oom locked subtree of the same
> hierarchy:
> A
> / \
> B \
> /\ \
> C D E
>
> B - C - D tree might be already locked. While we want to enable locking E
> subtree because OOM situations cannot influence each other we definitely
> do not want to allow locking A.
> Therefore we have to refuse lock if any subtree is already locked and
> clear up the lock for all nodes that have been set up to the failure
> point.
> Unlock path is then very easy because we always unlock only that subtree
> we have locked previously.
>
> Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
> ---
> mm/memcontrol.c | 48 +++++++++++++++++++++++++++++++++++++++---------
> 1 files changed, 39 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..29f00d0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1803,22 +1803,51 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> /*
> * Check OOM-Killer is already running under our hierarchy.
> * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
> */
> static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> {
> - int x, lock_count = 0;
> - struct mem_cgroup *iter;
> + int x, lock_count = -1;
> + struct mem_cgroup *iter, *failed = NULL;
> + bool cond = true;
>
> - for_each_mem_cgroup_tree(iter, mem) {
> - x = atomic_inc_return(&iter->oom_lock);
> - lock_count = max(x, lock_count);
> + for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> + x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> + if (lock_count == -1)
> + lock_count = x;
> + else if (lock_count != x) {
> + /*
> + * this subtree of our hierarchy is already locked
> + * so we cannot give a lock.
> + */
> + lock_count = 0;
> + failed = iter;
> + cond = false;
> + }
> }

Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
Before lock
A 0
B 1
C 1
D 1
E 0

After lock
A 1
B 1
C 1
D 1
E 0

here, failed = B, cond = false. Undo routine will unlock A.
Hmm, seems to work in this case.

But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
know "A" is in OOM. wakeup processes in A which is waiting for oom recover..

Will this work ?
==
# cgcreate -g memory:A
# cgset -r memory.use_hierarchy=1 A
# cgset -r memory.oom_control=1 A
# cgset -r memory.limit_in_bytes= 100M
# cgset -r memory.memsw.limit_in_bytes= 100M
# cgcreate -g memory:A/B
# cgset -r memory.oom_control=1 A/B
# cgset -r memory.limit_in_bytes=20M
# cgset -r memory.memsw.limit_in_bytes=20M

Assume malloc XXX is a program allocating XXX Megabytes of memory.

# cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
# cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A

Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.

# cgset -r memory.memsw.limit_in_bytes=300M A
# cgset -r memory.limit_in_bytes=300M A

malloc 80 will end.

Thanks,
-Kame

>
> - if (lock_count == 1)
> - return true;
> - return false;
> + if (!failed)
> + goto done;
> +
> + /*
> + * OK, we failed to lock the whole subtree so we have to clean up
> + * what we set up to the failing subtree
> + */
> + cond = true;
> + for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> + if (iter == failed) {
> + cond = false;
> + continue;
> + }
> + atomic_set(&iter->oom_lock, 0)
> + }

> +done:
> + return lock_count;
> }
>
> +/*
> + * Has to be called with memcg_oom_mutex
> + */
> static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
> {
> struct mem_cgroup *iter;
> @@ -1916,7 +1945,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> finish_wait(&memcg_oom_waitq, &owait.wait);
> }
> mutex_lock(&memcg_oom_mutex);
> - mem_cgroup_oom_unlock(mem);
> + if (locked)
> + mem_cgroup_oom_unlock(mem);
> memcg_wakeup_oom(mem);
> mutex_unlock(&memcg_oom_mutex);
>
> --
> 1.7.5.4
>
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Phil Carmody: "[PATCH 1/1] trivial: Fix potentially dangerous trailing ';' in #defined values/expressions"
Previous message: Dave Chinner: "Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages inkswapd except in high priority"
In reply to: Michal Hocko: "Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather thancoutner"
Next in thread: Michal Hocko: "Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather thancoutner"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]