Re: [RFC] panic_on_oom_timeout
From: Michal Hocko
Date: Wed Jun 17 2015 - 08:37:08 EST
On Wed 17-06-15 21:16:37, Tetsuo Handa wrote:
> Michal Hocko wrote a few minutes ago:
> > Subject: [RFC -v2] panic_on_oom_timeout
>
> Oops, we raced...
>
> Michal Hocko wrote:
> > On Tue 16-06-15 22:14:28, Tetsuo Handa wrote:
[...]
> > > Since memcg OOM is less critical than system OOM because administrator still
> > > has chance to perform steps to resolve the OOM state, we could give longer
> > > timeout (e.g. 600 seconds) for memcg OOM while giving shorter timeout (e.g.
> > > 10 seconds) for system OOM. But if (a) is impossible, trying to configure
> > > different timeout for non-system OOM stall makes no sense.
> >
> > I still do not see any point for a separate timeouts.
> >
> I think that administrator cannot configure adequate timeout if we don't allow
> separate timeouts.
Why? What prevents a user space policy if the system is still usable?
> > Again panic_on_oom=2 sounds very dubious to me as already mentioned. The
> > life would be so much easier if we simply start by supporting
> > panic_on_oom=1 for now. It would be a simple timer (as we cannot use
> > DELAYED_WORK) which would just panic the machine after a timeout. We
>
> My patch recommends administrators to stop setting panic_on_oom to non-zero
> value and to start setting a separate timeouts, one is for system OOM (short
> timeout) and the other is for non-system OOM (long timeout).
>
> How does my patch involve panic_on_oom ?
You are panicing the system on OOM condition. It feels natural to bind a
timeout based policy to this knob.
> My patch does not care about dubious panic_on_oom=2.
Yes, it replaces it by additional timeouts which seems an overkill to
me.
> > > > Besides that oom_unkillable_task doesn't sound like a good match to
> > > > evaluate this logic. I would expect it to be in oom_scan_process_thread.
> > >
> > > Well, select_bad_process() which calls oom_scan_process_thread() would
> > > break out from the loop when encountering the first TIF_MEMDIE task.
> > > We need to change
> > >
> > > case OOM_SCAN_ABORT:
> > > rcu_read_unlock();
> > > return (struct task_struct *)(-1UL);
> > >
> > > to defer returning of (-1UL) when a TIF_MEMDIE thread was found, in order to
> > > make sure that all TIF_MEMDIE threads are examined for timeout. With that
> > > change made,
> > >
> > > if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> > > /*** this location ***/
> > > if (!force_kill)
> > > return OOM_SCAN_ABORT;
> > > }
> > >
> > > in oom_scan_process_thread() will be an appropriate place for evaluating
> > > this logic.
> >
> > You can also keep select_bad_process untouched and simply check the
> > remaining TIF_MEMDIE tasks in oom_scan_process_thread (if the timeout is > 0
> > of course so the most configurations will be unaffected).
>
> The most configurations will be unaffected because there is usually no
> TIF_MEMDIE thread. But if something went wrong and there were 100 TIF_MEMDIE
> threads out of 10000 threads, traversing the tasklist from
> oom_scan_process_thread() whenever finding a TIF_MEMDIE thread sounds
> wasteful to me. If we keep traversing from select_bad_process(), the nuber
> of threads to check remains 10000.
Yes, but the code would be uglier and duplicated for memcg and global case.
Also this is an extremely slow path so optimization to skip scanning
some tasks is not worth making the code more obscure.
[...]
> @@ -1583,11 +1584,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> case OOM_SCAN_CONTINUE:
> continue;
> case OOM_SCAN_ABORT:
> - css_task_iter_end(&it);
> - mem_cgroup_iter_break(memcg, iter);
> - if (chosen)
> - put_task_struct(chosen);
> - goto unlock;
> + memdie_pending = true;
> + continue;
> case OOM_SCAN_OK:
> break;
> };
OOM_SCAN_ABORT can be returned even for !TIF_MEMDIE task so you might
force a victim selection when there is an exiting task and we could
delay actual killing.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/