Re: [patch 1/2] mm, memcg: avoid oom notification when current needsaccess to memory reserves
From: Johannes Weiner
Date: Tue Dec 03 2013 - 15:17:33 EST
On Tue, Dec 03, 2013 at 01:04:54PM +0100, Michal Hocko wrote:
> On Mon 02-12-13 16:25:00, Johannes Weiner wrote:
> > On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
> [...]
> > > But we are not talking just about races here. What if the OOM is a
> > > result of an OOM action itself. E.g. a killed task faults a memory in
> > > while exiting and it hasn't freed its memory yet. Should we notify in
> > > such a case? What would an userspace OOM handler do (the in-kernel
> > > implementation has an advantage because it can check the tasks flags)?
> >
> > We don't notify in such a case. Every charge from a TIF_MEMDIE or
> > exiting task is bypassing the limit immediately. Not even reclaim.
>
> Not really. Assume a memcg is under OOM. A task is killed by
> userspace so we get into signal delivery code which clears
> fatal_signal_pending and the code goes on to exit but then it faults in.
> __mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is
> not set yet. OOM is still not resolved so we are back to square one.
Ah, that's a completely separate problem, though. One issue I have
with these checks is that I never know which one are shortcuts and
micro optimizations and which one are functionally necessary.
> > > > So again, I don't see this patch is doing anything but blur the
> > > > current line and make notification less predictable. And, as someone
> > > > else in this thread already said, it's a uservisible change in
> > > > behavior and would break known tuning usecases.
> > >
> > > I would like to understand how would such a tuning usecase work and how
> > > it would break with this change.
> >
> > I would do test runs and with every run increase the size of the
> > workload until I get OOM notifications to know when the kernel has
> > been pushed beyond its limits and available memory + reclaim
> > capability can't keep up with the workload anymore.
> >
> > Not informing me just because due to timing variance a random process
> > exits in the last moment would be flat out lying. The machine is OOM.
> > Many reclaim cycles failing is a good predictor. Last minute exit of
> > random task is not, it's happenstance and I don't want to rely on a
> > fluke like this to size my workload.
>
> Such a metric would be inherently racy for the same reason. You simply
> cannot rely on not seeing OOMs because an exiting task managed to leave
> in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and
> before mem_cgroup_oom). Difference between in time and little bit too
> late is just too fragile to be useful IMO.
Are we saying the same thing? Or did I misunderstand you?
> > > Consider the above example. You would get 2 notification for the very
> > > same OOM condition.
> > > On the other hand if the encountered exiting task was just a race then
> > > we have two options basically. Either there are more tasks racing (and
> > > not all of them are exiting) or there is only one (all are exiting).
> > > We will not loose any notification in the first case because the flags
> > > are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> > > and notify.
> > > The second case is more interesting. Userspace won't get notification
> > > but we also know that no action is required as the OOM will be resolved
> > > by itself. And now we should consider whether notification would do more
> > > good than harm. The tuning usecase would loose one event. Would such a
> > > rare situation skew the statistics so much? On the other hand a real OOM
> > > killer would do something which means something will be killed. I find
> > > the later much worse.
> >
> > We already check in various places (sigh) for whether reclaim and
> > killing is still necessary. What is the end game here? An endless
> > loop right before the kill where we check if the kill is still
> > necessary?
>
> The patch as is doesn't cover all the cases and ideally we should check
> that for OOM_SCAN_ABORT and later in oom_kill_process because they can
> back out as well if we want to have only-on-action notification. Such a
> solution would be too messy though.
There is never an only-on-action notification and there is no check to
cover all cases. This is the fallacy I'm trying to point out. All we
can do is pick a fairly predictable line and stick to it.
> But as I've said. The primary reason I liked this change is because it
> solves the above mentioned OOM during exit issue and it also prevents
> from a pointless notification. I am perfectly fine with moving the
> check+set TIF_MEMDIE down so solve only the issue #1 and do not mess
> with notifications.
The notification is not pointless, we are OOM at the time we make the
decision.
> > You're not fixing this problem, so why make the notifications less
> > reliable?
>
> I am still not seeing why it is less reliable. The notification is
> inherently racy so you cannot rely on any simple metrics based on their
> count (at least not in general).
It should be based on reclaim failing, which is more predictable and
reproducible across multiple runs. Anything else is just random
coincidence, I can't fathom why you think that it's at all meaningful.
You could do one more reclaim cycle and a charge attempt, or even just
add an msleep() between the last reclaim cycle and the last charge
attempt for exactly the same outcome.
> > > So all in all. I do agree with you that this path will never be race
> > > free and without pointless OOM actions. I also agree that drawing the
> > > line is hard. But I am more inclined to prevent from notification when
> > > we already know that _no action_ is required because IMHO the vast
> > > majority of oom listeners are there to _do_ an action which is mostly
> > > deadly.
> >
> > If you want to push the machine so hard that active measures like
> > reclaim can't keep up and you rely on stupid timing like this to save
> > your sorry butt, then you'll just have to live with the
> > unpredictability of it. You're going to eat kills that might have
> > been avoided last minute either way. It's no excuse to plaster the MM
> > with TIF_MEMDIE checks and last-minute cgroup margin checks in the
> > weirdest locations.
>
> Yes I do not agree with putting TIF_MEMDIE checks all over the place and
> we should reduce their number to minimum. It is fair to say that the
> patch didn't add a new check. It just has moved it to cover both
> in-kernel and user space oom paths. That was a bonus I liked. To be
> honest I do not see the notification side effect as a big deal as those
> are racy anyway and I would rather see fewer of them than more
> (especially when it is clear that nothing is to be done).
I don't know why we have this check there in the first place, it
should be part of the OOM victim selection process and not a hacky
shortcut in memcg. It's a blatant layering violation for no obvious
reason.
> > Again, how likely is it anyway that the kill was truly skipped and not
> > just deferred? Reclaim failing is a good indicator that you're in
> > trouble, a random task exiting in an ongoing workload does not say
> > much. The machine could still be in trouble, so you just deferred the
> > inevitable, you didn't really avoid a kill.
> >
> > At this point we are talking about OOM kill frequency and statistical
> > probability during apparently normal operations. The OOM killer was
> > never written for that, it was supposed to be a last minute resort
> > that should not occur during normal operations and only if all SANE
> > measures to avoid it have failed. 99% of all users have no interest
> > in these micro-optimizations and we shouldn't clutter the code and
> > have unpredictable behavior without even a trace of data to show that
> > this is anything more than a placebo measure for one use case.
>
> OK, as it seems that the notification part is too controversial, how
> would you like the following? It reverts the notification part and still
> solves the fault on exit path. I will prepare the full patch with the
> changelog if this looks reasonable:
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 28c9221b74ea..f44fe7e65a98 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int points = 0;
> struct task_struct *chosen = NULL;
>
> + /*
> + * If current has a pending SIGKILL or is exiting, then automatically
> + * select it. The goal is to allow it to allocate so that it may
> + * quickly exit and free its memory.
> + */
> + if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> + set_thread_flag(TIF_MEMDIE);
> + goto cleanup;
return;
Anyway, I wish this would not be here at all and part of the thread
scanner. As David said, this is a very cold path, why the shortcut?
> @@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
> if (!handle)
> goto cleanup;
>
> - /*
> - * If current has a pending SIGKILL or is exiting, then automatically
> - * select it. The goal is to allow it to allocate so that it may
> - * quickly exit and free its memory.
> - */
> - if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> - set_thread_flag(TIF_MEMDIE);
> - goto cleanup;
> - }
> -
> owait.memcg = memcg;
> owait.wait.flags = 0;
> owait.wait.func = memcg_oom_wake_function;
> @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
> schedule();
> mem_cgroup_unmark_under_oom(memcg);
> finish_wait(&memcg_oom_waitq, &owait.wait);
> +
> + /* Userspace OOM handler cannot set TIF_MEMDIE to a target */
> + if (memcg->oom_kill_disable) {
> + if ((fatal_signal_pending(current) ||
> + current->flags & PF_EXITING))
> + set_thread_flag(TIF_MEMDIE);
> + }
This is an entirely different change that I think makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/