Re: [patch 1/2] mm, memcg: avoid oom notification when current needsaccess to memory reserves

From: Michal Hocko
Date: Mon Dec 02 2013 - 15:02:36 EST

Next message: Greg Kroah-Hartman: "[PATCH 3.12 096/212] PM / hibernate: Avoid overflow in hibernate_preallocate_memory()"
Previous message: Greg Kroah-Hartman: "[PATCH 3.12 097/212] PM / runtime: Use pm_runtime_put_sync() in __device_release_driver()"
Next in thread: Johannes Weiner: "Re: [patch 1/2] mm, memcg: avoid oom notification when current needsaccess to memory reserves"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed 27-11-13 11:34:36, Johannes Weiner wrote:
> On Tue, Nov 26, 2013 at 04:53:47PM -0800, David Rientjes wrote:
> > On Fri, 22 Nov 2013, Johannes Weiner wrote:
> >
> > > But userspace in all likeliness DOES need to take action.
> > >
> > > Reclaim is a really long process. If 5 times doing 12 priority cycles
> > > and scanning thousands of pages is not enough to reclaim a single
> > > page, what does that say about the health of the memcg?
> > >
> > > But more importantly, OOM handling is just inherently racy. A task
> > > might receive the kill signal a split second *after* userspace was
> > > notified. Or a task may exit voluntarily a split second after a
> > > victim was chosen and killed.
> > >
> >
> > That's not true even today without the userspace oom handling proposal
> > currently being discussed if you have a memcg oom handler attached to a
> > parent memcg with access to more memory than an oom child memcg. The oom
> > handler can disable the child memcg's oom killer with memory.oom_control
> > and implement its own policy to deal with any notification of oom.
>
> I was never implying the kernel handler. All the races exist with
> userspace handling as well.
>
> > This patch is required to ensure that in such a scenario that the oom
> > handler sitting in the parent memcg only wakes up when it's required to
> > intervene.
>
> A task could receive an unrelated kill between the OOM notification
> and going to sleep to wait for userspace OOM handling. Or another
> task could exit voluntarily between the notification and waitqueue
> entry, which would again be short-cut by the oom_recover of the exit
> uncharges.
>
> oom: other tasks:
> check signal/exiting
> could exit or get killed here
> mem_cgroup_oom_trylock()
> could exit or get killed here
> mem_cgroup_oom_notify()
> could exit or get killed here
> if (userspace_handler)
> sleep() could exit or get killed here
> else
> oom_kill()
> could exit or get killed here
>
> It does not matter where your signal/exiting check is, OOM
> notification can never be race free because OOM is just an arbitrary
> line we draw. We have no idea what all the tasks are up to and how
> close they are to releasing memory. Even if we freeze the whole group
> to handle tasks, it does not change the fact that the userspace OOM
> handler might kill one task and after the unfreeze another task
> immediately exits voluntarily or got a kill signal a split second
> after it was frozen.
>
> You can't fix this. We just have to draw the line somewhere and
> accept that in rare situations the OOM kill was unnecessary.

But we are not talking just about races here. What if the OOM is a
result of an OOM action itself. E.g. a killed task faults a memory in
while exiting and it hasn't freed its memory yet. Should we notify in
such a case? What would an userspace OOM handler do (the in-kernel
implementation has an advantage because it can check the tasks flags)?

> So again, I don't see this patch is doing anything but blur the
> current line and make notification less predictable. And, as someone
> else in this thread already said, it's a uservisible change in
> behavior and would break known tuning usecases.

I would like to understand how would such a tuning usecase work and how
it would break with this change.

Consider the above example. You would get 2 notification for the very
same OOM condition.
On the other hand if the encountered exiting task was just a race then
we have two options basically. Either there are more tasks racing (and
not all of them are exiting) or there is only one (all are exiting).
We will not loose any notification in the first case because the flags
are checked before mem_cgroup_oom_trylock and so one of tasks would lock
and notify.
The second case is more interesting. Userspace won't get notification
but we also know that no action is required as the OOM will be resolved
by itself. And now we should consider whether notification would do more
good than harm. The tuning usecase would loose one event. Would such a
rare situation skew the statistics so much? On the other hand a real OOM
killer would do something which means something will be killed. I find
the later much worse.

So all in all. I do agree with you that this path will never be race
free and without pointless OOM actions. I also agree that drawing the
line is hard. But I am more inclined to prevent from notification when
we already know that _no action_ is required because IMHO the vast
majority of oom listeners are there to _do_ an action which is mostly
deadly.

Finally if this is too controversial then I would at least like to see
the same check introduced before we go to sleep for oom_kill_disable
case because that is a real bug.

Thanks!
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Greg Kroah-Hartman: "[PATCH 3.12 096/212] PM / hibernate: Avoid overflow in hibernate_preallocate_memory()"
Previous message: Greg Kroah-Hartman: "[PATCH 3.12 097/212] PM / runtime: Use pm_runtime_put_sync() in __device_release_driver()"
Next in thread: Johannes Weiner: "Re: [patch 1/2] mm, memcg: avoid oom notification when current needsaccess to memory reserves"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]