Re: [v8 0/4] cgroup-aware OOM killer

From: Johannes Weiner
Date: Thu Sep 21 2017 - 10:21:28 EST


On Mon, Sep 11, 2017 at 01:44:39PM -0700, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
>
> > This patchset makes the OOM killer cgroup-aware.
> >
> > v8:
> > - Do not kill tasks with OOM_SCORE_ADJ -1000
> > - Make the whole thing opt-in with cgroup mount option control
> > - Drop oom_priority for further discussions
>
> Nack, we specifically require oom_priority for this to function correctly,
> otherwise we cannot prefer to kill from low priority leaf memcgs as
> required. v8 appears to implement new functionality that we want, to
> compare two memcgs based on usage, but without the ability to influence
> that decision to protect important userspace, so now I'm in a position
> where (1) nothing has changed if I don't use the new mount option or (2) I
> get completely different oom kill selection with the new mount option but
> not the ability to influence it. I was much happier with the direction
> that v7 was taking, but since v8 causes us to regress without the ability
> to change memcg priority, this has to be nacked.

That's a ridiculous nak.

The fact that this patch series doesn't solve your particular problem
is not a technical argument to *reject* somebody else's work to solve
a different problem. It's not a regression when behavior is completely
unchanged unless you explicitly opt into a new functionality.

So let's stay reasonable here.

The patch series has merit as it currently stands. It makes OOM
killing in a cgrouped system fairer and less surprising. Whether you
have the ability to influence this in a new way is an entirely
separate discussion. It's one that involves ABI and user guarantees.

Right now Roman's patches make no guarantees on how the cgroup tree is
descended. But once we define an interface for prioritization, it
locks the victim algorithm into place to a certain extent.

It also involves a discussion about how much control userspace should
have over OOM killing in the first place. It's a last-minute effort to
save the kernel from deadlocking on memory. Whether that is the time
and place to have userspace make clever resource management decisions
is an entirely different thing than what Roman is doing.

But this patch series doesn't prevent any such future discussion and
implementations, and it's not useless without it. So let's not
conflate these two things, and hold the priority patch for now.

Thanks.