Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOMaccess to memory reserves

From: David Rientjes
Date: Tue Jan 21 2014 - 01:13:46 EST


On Thu, 16 Jan 2014, Michal Hocko wrote:

> > The heuristic may have existed for ages, but the proposed memcg
> > configuration for preserving memory such that userspace oom handlers may
> > run such as
> >
> > _____root______
> > / \
> > user oom
> > / \ / \
> > A B a b
> >
> > where user/memory.limit_in_bytes == [amount of present RAM] +
> > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be
> > problematic, including Johannes' buggy bypass for charges in memcgs with
> > pending memcgs that has since been fixed after I identified it. This
> > bypass is included. Processes attached to "a" and "b" are userspace oom
> > handlers for processes attached to "A" and "B", respectively.
> >
> > The amount of memory you're talking about is proportional to the number of
> > processes that have pending SIGKILLs (and now those with PF_EXITING set),
> > the former of which is obviously more concerning since they could be
> > charging memory at any point in the kernel that would succeed.
>
> I understand your concerns. Yes, excessive charges might be dangerous. I
> haven't dismissed that when you mentioned it earlier. I am just
> repeatedly asking how much memory are we talking about, how real is the
> issue and what are all the other conseqeunces. And for some reason you
> are not providing that information (or maybe I am just not seeing that
> in your responses) and that is why we are stuck in circle.
>

Wtf are you talking about? You're adding a bypass in this patch and then
you're asking me to go and see how much memory it could potentially bypass
and take away from oom handlers under the above memcg configuration? This
seems like something you should provide before throwing out patches that
nobody has tested if you want to make the argument that the above memcg
configuration is valid for handling userspace oom notifications.

And you certainly have dismissed what I've mentioned earlier when I said
that anybody can add memory allocation to the exit path later on and
nobody is going to think about how much memory this is going to bypass to
the root memcg and potentially take away from userspace oom handlers.

There's two possible ways to forward this:

- avoid bypass to the root memcg in every possible case such that the
above memcg configuration actually makes a guarantee to userspace oom
handlers attached to it, or

- provide per-memcg memory reserves such that userspace oom handlers can
allocate and charge memory without the above memcg configuration so
there is a guarantee.

What's not acceptable, now or ever, is suggesting a solution to a problem
that is supposed to guarantee some resource and then allow under some
circumstances that resource to be completely depleted such that the
solution never works.

> Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
> should terminate in a short time. I think that having a setup with a
> guarantee of never triggering the global OOM is too ambitious and I am
> even skeptical it would be achievable.
>

"Short time" is meaningless if the memory allocation causes memory to not
be available to userspace oom handlers. If allocations are allowed to be
charged because you're in the exit() path or because you have SIGKILL,
that can result in a system oom condition that would prevent userspace
from being able to handle them.

> > I'm debating both fatal_signal_pending() and PF_EXITING here since they
> > are now both bypasses, we need to remove fatal_signal_pending(). My
> > simple question with your patch: how do you guarantee memory to processes
> > attached to "a" and "b"?
>
> The only way you can get that _guarantee_ is to account all the memory
> allocations. And that is not implemented and I would even question
> whether it is worthwhile. So we still have to live with a possibility
> of triggering the global OOM killer. That's why I believe we need to be
> able to tell the kernel what is the user policy for oom killer (that is
> a different discussion though).
>

So you're saying that Tejun's suggested userspace oom handler
configuration is pointless, correct? We can certainly provide a guarantee
if memory is reserved specifically for userspace oom handling like I
proposed, the same way that memory reserves are guaranteed for oom killed
processes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/