Re: [RFC 1/3] memcg: notify userspace about OOM only when and actionis due

From: Johannes Weiner
Date: Wed Jan 15 2014 - 15:31:04 EST


On Wed, Jan 15, 2014 at 08:00:15PM +0100, Michal Hocko wrote:
> On Wed 15-01-14 12:56:55, Johannes Weiner wrote:
> > On Wed, Jan 15, 2014 at 04:01:06PM +0100, Michal Hocko wrote:
> > > Userspace is currently notified about OOM condition after reclaim
> > > fails to uncharge any memory after MEM_CGROUP_RECLAIM_RETRIES rounds.
> > > This usually means that the memcg is really in troubles and an
> > > OOM action (either done by userspace or kernel) has to be taken.
> > > The kernel OOM killer however bails out and doesn't kill anything
> > > if it sees an already dying/exiting task in a good hope a memory
> > > will be released and the OOM situation will be resolved.
> > >
> > > Therefore it makes sense to notify userspace only after really all
> > > measures have been taken and an userspace action is required or
> > > the kernel kills a task.
> > >
> > > This patch is based on idea by David Rientjes to not notify
> > > userspace when the current task is killed or in a late exiting.
> > > The original patch, however, didn't handle in kernel oom killer
> > > back offs which is implemtented by this patch.
> > >
> > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
> >
> > OOM is a temporary state because any task can exit at a time that is
> > not under our control and outside our knowledge. That's why the OOM
> > situation is defined by failing an allocation after a certain number
> > of reclaim and charge attempts.
> >
> > As of right now, the OOM sampling window is MEM_CGROUP_RECLAIM_RETRIES
> > loops of charge attempts and reclaim. If a racing task is exiting and
> > releasing memory during that window, the charge will succeed fine. If
> > the sampling window is too short in practice, it will have to be
> > extended, preferrably through increasing MEM_CGROUP_RECLAIM_RETRIES.
>
> The patch doesn't try to address the above race because that one is
> unfixable. I hope that is clear.
>
> It just tries to reduce burden on the userspace oom notification
> consumers and given them a simple semantic. Notification comes only if
> an action will be necessary (either kernel kills something or user space
> is expected).

I.e. turn the OOM notification into an OOM kill event notification.

> E.g. consider a handler which tries to clean up after kernel handled
> OOM and killed something. If the kernel could back off and refrain
> from killing anything after the norification already fired up then the
> userspace has no practical way to detect that (except for checking the
> kernel log to search for OOM messages which might get suppressed due to
> rate limitting etc.. Nothing I would call optimal).
> Or do you think that such a use case doesn't make much sense and it is
> an abuse of the notification interface?

I'm not sure what such a cleanup would be doing, a real life usecase
would be useful when we are about to change notification semantics.
I've heard "taking down the remaining tasks of the job" before, but
that would be better solved by having the OOM killer operate on
cgroups as single entities instead of taking out individual tasks.

On the other hand, I can see how people use the OOM notification to
monitor system/cgroup health. David argued that vmpressure "critical"
would be the same thing. But first of all, this is not an argument to
change semantics of an established interface. And secondly, it only
tells you that reclaim is struggling, it doesn't give you the point of
failure (the OOM condition), regardless of what the docs claim.

So, please, if you need a new interface, make a clear case for it and
then we can discuss if it's the right way to go. We do the same for
every other user interface, whether it's a syscall, an ioctl, a procfs
file etc. Just taking something existing that is close enough and
skewing the semantics in your favor like this is not okay.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/