Re: [RFC][PATCH 1/2] memcg: oom kill handling improvement

From: Daisuke Nishimura
Date: Thu Feb 25 2010 - 23:21:12 EST


On Wed, 24 Feb 2010 16:59:21 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> These are dump of patches just for showing concept, what I want to do.
> But not tested. please see if you have free time. (you can ignore ;)
>
> Anyway, this will HUNK to the latest mmotm, Kirill's work is merged.
>
> This is not related to David's work. I don't hesitate to rebase mine
> to the mmotm if his one is merged, it's easy.
> But I'm not sure his one goes to mm soon.
>
> 1st patch is for better handling oom-kill under memcg.
It's bigger than I expected, but it basically looks good to me.

> 2nd patch is oom-notifier and oom-kill-disable for memcg knob.
>
This feature is very atractive.


One comment to this patch for now.

> +/*
> + * Check there are ongoing oom-kill in this hierarchy or not.
> + * If now under oom-kill, wait for some event to restart job.
> + */
> +static bool memcg_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +{
> + int oom_count = 0;
> + DEFINE_WAIT(wait);
> + /*
> + * Considering hierarchy (below)
> + * /A
> + * /01
> + * /02
> + * If 01 or 02 is under oom-kill, oom-kill in A should wait.
> + * If "A" is under oom-kill, oom-kill in 01 and 02 should wait.
> + * (task in 01/02 can be killed.)
> + * But if 01 is under oom-kill, 02's oom-kill is independent from it.
> + */
> + prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE);
> + mem_cgroup_walk_tree(mem, &oom_count, set_memcg_oom_cb);
> + /* Am I the 1st oom killer in this sub hierarchy ? */
> + if (oom_count == 1) {
> + finish_wait(&memcg_oom_waitq, &wait);
> + mem_cgroup_out_of_memory(mem, mask);
> + mem_cgroup_walk_tree(mem, NULL, unset_memcg_oom_cb);
I think we need call memcg_oom_wake() here. Some contexts might have slept already,
but other callers of memcg_oom_wake() calle it after checking memcg_under_oom(),
so if we don't wake them up here, they continue to sleep, IIUC.


Thanks,
Daisuke Nishimura.

> I'm sorry that I'll be absent tomorrow.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> This is updated version of oom handling improvement om memcg.
> But all codes are totaly renewed. This may not be sophisiticated well but
> enough for showing idea.
>
> This patch does following things.
> * set "memcg is under OOM" if somone gets into OOM under a memcg.
> like zone's OOM lock, tree-of-memcg is marked as under OOM.
> By this. simlutabeous OOM kill in a tree will not happen.
>
> * When other threads try to reclaim memory or call oom-kill, it
> checks its own target memcg is under oom or not. If someone
> calls oom-killer already, the thread will be queued to waitq.
>
> * At some event which makes room for new memory, threads on waitq
> are waken up.
> ** A page (or chunk of pages) are unchraged.
> ** A task is moved.
> ** limit is enlarged.
>
> And this patch also allows to check "current's memcg is changed or not"
> while charging.
>
> Considering what admin/daemon can do when it notice OOM,
> * kill a process
> * move a process (to other cgroup which has free area)
> * remove a file (on tmpfs or some)
> * enlarge limit
> I think all chances for wakeing up waiters are covered by these.
>
> After this patch, memcg's accounting will not fail in usual path.
> If all tasks are OOM_DISABLE, memcg may hang. But admin can have
> several options described in above. So, oom notifier+freeze should be
> implemented.
>
> TODO: maybe not difficult.
> * Add oom notifier. (can reuse memory.threashold ?)
> * Add a switch for oom-freeze rather than oom-kill.
>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: Balbir Singh <balbir@xxxxxxxxxx>
> Cc: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> ---
> include/linux/memcontrol.h | 6 -
> mm/memcontrol.c | 208 +++++++++++++++++++++++++++++++++++----------
> mm/oom_kill.c | 11 --
> 3 files changed, 167 insertions(+), 58 deletions(-)
>
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -200,7 +200,6 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> - unsigned long last_oom_jiffies;
> atomic_t refcnt;
>
> unsigned int swappiness;
> @@ -223,6 +222,8 @@ struct mem_cgroup {
> */
> unsigned long move_charge_at_immigrate;
>
> + /* counting ongoing OOM requests under sub hierarchy */
> + atomic_t oom_count;
> /*
> * percpu counter.
> */
> @@ -1096,6 +1097,89 @@ done:
> }
>
> /*
> + * set/under memcg_oom counting is done under mutex.
> + */
> +static DEFINE_MUTEX(memcg_oom_mutex);
> +static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
> +
> +static int set_memcg_oom_cb(struct mem_cgroup *mem, void *data)
> +{
> + int *max_count = (int*)data;
> + int count = atomic_inc_return(&mem->oom_count);
> + if (*max_count < count)
> + *max_count = count;
> + return 0;
> +}
> +
> +static int unset_memcg_oom_cb(struct mem_cgroup *mem, void *data)
> +{
> + atomic_set(&mem->oom_count, 0);
> + return 0;
> +}
> +
> +static bool memcg_under_oom(struct mem_cgroup *mem)
> +{
> + if (atomic_read(&mem->oom_count))
> + return true;
> + return false;
> +}
> +
> +static void memcg_oom_wait(struct mem_cgroup *mem)
> +{
> + DEFINE_WAIT(wait);
> +
> + prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE);
> + if (memcg_under_oom(mem))
> + schedule();
> + finish_wait(&memcg_oom_waitq, &wait);
> +}
> +
> +static void memcg_oom_wake(void)
> +{
> + /* This may wake up unnecessary tasks..but it's not big problem */
> + wake_up_all(&memcg_oom_waitq);
> +}
> +/*
> + * Check there are ongoing oom-kill in this hierarchy or not.
> + * If now under oom-kill, wait for some event to restart job.
> + */
> +static bool memcg_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +{
> + int oom_count = 0;
> + DEFINE_WAIT(wait);
> + /*
> + * Considering hierarchy (below)
> + * /A
> + * /01
> + * /02
> + * If 01 or 02 is under oom-kill, oom-kill in A should wait.
> + * If "A" is under oom-kill, oom-kill in 01 and 02 should wait.
> + * (task in 01/02 can be killed.)
> + * But if 01 is under oom-kill, 02's oom-kill is independent from it.
> + */
> + prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE);
> + mem_cgroup_walk_tree(mem, &oom_count, set_memcg_oom_cb);
> + /* Am I the 1st oom killer in this sub hierarchy ? */
> + if (oom_count == 1) {
> + finish_wait(&memcg_oom_waitq, &wait);
> + mem_cgroup_out_of_memory(mem, mask);
> + mem_cgroup_walk_tree(mem, NULL, unset_memcg_oom_cb);
> + } else {
> + /*
> + * Wakeup is called when
> + * 1. pages are uncharged. (by killed, or removal of a file)
> + * 2. limit is enlarged.
> + * 3. a task is moved.
> + */
> + schedule();
> + finish_wait(&memcg_oom_waitq, &wait);
> + }
> + if (test_thread_flag(TIF_MEMDIE))
> + return false;
> + return true;
> +}
> +
> +/*
> * This function returns the number of memcg under hierarchy tree. Returns
> * 1(self count) if no children.
> */
> @@ -1234,34 +1318,6 @@ static int mem_cgroup_hierarchical_recla
> return total;
> }
>
> -bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> - bool ret = false;
> - struct mem_cgroup *mem;
> - struct mm_struct *mm;
> -
> - rcu_read_lock();
> - mm = task->mm;
> - if (!mm)
> - mm = &init_mm;
> - mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> - if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> - ret = true;
> - rcu_read_unlock();
> - return ret;
> -}
> -
> -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> -{
> - mem->last_oom_jiffies = jiffies;
> - return 0;
> -}
> -
> -static void record_last_oom(struct mem_cgroup *mem)
> -{
> - mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> -}
> -
> /*
> * Currently used to update mapped file statistics, but the routine can be
> * generalized to update other statistics as well.
> @@ -1419,6 +1475,7 @@ static int __cpuinit memcg_stock_cpu_cal
> return NOTIFY_OK;
> }
>
> +
> /*
> * Unlike exported interface, "oom" parameter is added. if oom==true,
> * oom-killer can be invoked.
> @@ -1427,17 +1484,21 @@ static int __mem_cgroup_try_charge(struc
> gfp_t gfp_mask, struct mem_cgroup **memcg,
> bool oom, struct page *page)
> {
> - struct mem_cgroup *mem, *mem_over_limit;
> - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + struct mem_cgroup *mem, *mem_over_limit, *recorded;
> + int nr_retries, csize;
> struct res_counter *fail_res;
> - int csize = CHARGE_SIZE;
> +
> +start:
> + nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + recorded = *memcg;
> + csize = CHARGE_SIZE;
> + mem = NULL;
>
> if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> /* Don't account this! */
> *memcg = NULL;
> return 0;
> }
> -
> /*
> * We always charge the cgroup the mm_struct belongs to.
> * The mm_struct's mem_cgroup changes on task migration if the
> @@ -1489,6 +1550,12 @@ static int __mem_cgroup_try_charge(struc
> }
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
> + /* already in OOM ? */
> + if (memcg_under_oom(mem_over_limit)) {
> + /* Don't add too much pressure to the host */
> + memcg_oom_wait(mem_over_limit);
> + goto retry;
> + }
>
> ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> gfp_mask, flags);
> @@ -1549,11 +1616,15 @@ static int __mem_cgroup_try_charge(struc
> }
>
> if (!nr_retries--) {
> - if (oom) {
> - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> - record_last_oom(mem_over_limit);
> - }
> - goto nomem;
> +
> + if (!oom)
> + goto nomem;
> + /* returnes false if current is killed */
> + if (memcg_handle_oom(mem_over_limit, gfp_mask))
> + goto retry;
> + /* For smooth oom-kill of current, return 0 */
> + css_put(&mem->css);
> + return 0;
> }
> }
> if (csize > PAGE_SIZE)
> @@ -1572,6 +1643,15 @@ done:
> nomem:
> css_put(&mem->css);
> return -ENOMEM;
> +
> +retry:
> + /*
> + * current's mem_cgroup can be moved while we're waiting for
> + * memory reclaim or OOM-Kill.
> + */
> + *memcg = recorded;
> + css_put(&mem->css);
> + goto start;
> }
>
> /*
> @@ -1589,6 +1669,9 @@ static void __mem_cgroup_cancel_charge(s
> VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
> WARN_ON_ONCE(count > INT_MAX);
> __css_put(&mem->css, (int)count);
> +
> + if (memcg_under_oom(mem))
> + memcg_oom_wake();
> }
> /* we don't need css_put for root */
> }
> @@ -2061,6 +2144,10 @@ direct_uncharge:
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> if (uncharge_memsw)
> res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> + /* Slow path to check OOM waiters */
> + if (!current->memcg_batch.do_batch || batch->memcg != mem)
> + if (memcg_under_oom(mem))
> + memcg_oom_wake();
> return;
> }
>
> @@ -2200,6 +2287,9 @@ void mem_cgroup_uncharge_end(void)
> res_counter_uncharge(&batch->memcg->res, batch->bytes);
> if (batch->memsw_bytes)
> res_counter_uncharge(&batch->memcg->memsw, batch->memsw_bytes);
> +
> + if (memcg_under_oom(batch->memcg))
> + memcg_oom_wake();
> /* forget this pointer (for sanity check) */
> batch->memcg = NULL;
> }
> @@ -2408,8 +2498,7 @@ void mem_cgroup_end_migration(struct mem
>
> /*
> * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
> + * Calling hierarchical_reclaim is not enough because we have to hand oom-kill.
> * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
> * not from the memcg which this page would be charged to.
> * try_charge_swapin does all of these works properly.
> @@ -2440,7 +2529,8 @@ static int mem_cgroup_resize_limit(struc
> u64 memswlimit;
> int ret = 0;
> int children = mem_cgroup_count_children(memcg);
> - u64 curusage, oldusage;
> + u64 curusage, oldusage, curlimit;
> + int enlarge = 0;
>
> /*
> * For keeping hierarchical_reclaim simple, how long we should retry
> @@ -2451,6 +2541,7 @@ static int mem_cgroup_resize_limit(struc
>
> oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>
> +
> while (retry_count) {
> if (signal_pending(current)) {
> ret = -EINTR;
> @@ -2468,6 +2559,9 @@ static int mem_cgroup_resize_limit(struc
> mutex_unlock(&set_limit_mutex);
> break;
> }
> + curlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> + if (curlimit < val)
> + enlarge = 1;
> ret = res_counter_set_limit(&memcg->res, val);
> if (!ret) {
> if (memswlimit == val)
> @@ -2477,8 +2571,20 @@ static int mem_cgroup_resize_limit(struc
> }
> mutex_unlock(&set_limit_mutex);
>
> - if (!ret)
> + if (!ret) {
> + /*
> + * If we enlarge limit of memcg under OOM,
> + * wake up waiters.
> + */
> + if (enlarge && memcg_under_oom(memcg))
> + memcg_oom_wake();
> + break;
> + }
> + /* Under OOM ? If so, don't add more pressure. */
> + if (memcg_under_oom(memcg)) {
> + ret = -EBUSY;
> break;
> + }
>
> mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> MEM_CGROUP_RECLAIM_SHRINK);
> @@ -2497,9 +2603,10 @@ static int mem_cgroup_resize_memsw_limit
> unsigned long long val)
> {
> int retry_count;
> - u64 memlimit, oldusage, curusage;
> + u64 memlimit, oldusage, curusage, curlimit;
> int children = mem_cgroup_count_children(memcg);
> int ret = -EBUSY;
> + int enlarge;
>
> /* see mem_cgroup_resize_res_limit */
> retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
> @@ -2521,6 +2628,9 @@ static int mem_cgroup_resize_memsw_limit
> mutex_unlock(&set_limit_mutex);
> break;
> }
> + curlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> + if (curlimit < val)
> + enlarge = 1;
> ret = res_counter_set_limit(&memcg->memsw, val);
> if (!ret) {
> if (memlimit == val)
> @@ -2530,8 +2640,15 @@ static int mem_cgroup_resize_memsw_limit
> }
> mutex_unlock(&set_limit_mutex);
>
> - if (!ret)
> + if (!ret) {
> + if (enlarge && memcg_under_oom(memcg))
> + memcg_oom_wake();
> break;
> + }
> + if (memcg_under_oom(memcg)) {
> + ret = -EBUSY;
> + continue;
> + }
>
> mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> MEM_CGROUP_RECLAIM_NOSWAP |
> @@ -3859,6 +3976,9 @@ one_by_one:
> ret = -EINTR;
> break;
> }
> + /* Undo precharges if there is ongoing OOM */
> + if (memcg_under_oom(mem))
> + return -ENOMEM;
> if (!batch_count--) {
> batch_count = PRECHARGE_COUNT_AT_ONCE;
> cond_resched();
> Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> @@ -487,6 +487,9 @@ retry:
> goto retry;
> out:
> read_unlock(&tasklist_lock);
> + /* give a chance to die for selected process */
> + if (!test_thread_flag(TIF_MEMDIE))
> + schedule_timeout_uninterruptible(1);
> }
> #endif
>
> @@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
> /* Got some memory back in the last second. */
> return;
>
> - /*
> - * If this is from memcg, oom-killer is already invoked.
> - * and not worth to go system-wide-oom.
> - */
> - if (mem_cgroup_oom_called(current))
> - goto rest_and_return;
> -
> if (sysctl_panic_on_oom)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> @@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
> * Give "p" a good chance of killing itself before we
> * retry to allocate memory.
> */
> -rest_and_return:
> if (!test_thread_flag(TIF_MEMDIE))
> schedule_timeout_uninterruptible(1);
> }
> Index: mmotm-2.6.33-Feb11/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.33-Feb11/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(v
> return false;
> }
>
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
> void mem_cgroup_update_file_mapped(struct page *page, int val);
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(v
> return true;
> }
>
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> - return false;
> -}
> -
> static inline int
> mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> {
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/