Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

From: Michal Hocko
Date: Thu Oct 05 2017 - 10:10:27 EST

On Thu 05-10-17 14:41:13, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> > [...]
> > > That will silently ignore what the user writes to the memory.oom_group
> > > control files across the system's cgroup tree.
> > >
> > > We'll have a knob that lets the workload declare itself an indivisible
> > > memory consumer, that it would like to get killed in one piece, and
> > > it's silently ignored because of a mount option they forgot to pass.
> > >
> > > That's not good from an interface perspective.
> >
> > Yes and that is why I think a boot time knob would be the most simple
> > way. It will also open doors for more oom policies in future which I
> > believe come sooner or later.
> So, we would rely on grub config to set up OOM policy? Sounds weird.
> We use boot options, when it's hard to implement on the fly switching
> (like turning on/off socket memory accounting), but here is not this case.

Well we define global policies with kernel command line so I do not
think it would be something unusual. An advantage is that you do not
have deal with semantic of the policy change during the runtime which is
something I am not sure we need or even want.

> > > On the other hand, the only benefit of this patch is to shield users
> > > from changes to the OOM killing heuristics. Yet, it's really hard to
> > > imagine that modifying the victim selection process slightly could be
> > > called a regression in any way. We have done that many times over,
> > > without a second thought on backwards compatibility:
> > >
> > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > > a63d83f427fb oom: badness heuristic rewrite
> > > 778c14affaf9 mm, oom: base root bonus on current usage
> >
> > yes we have changed that without a deeper considerations. Some of those
> > changes are arguable (e.g. child scarification). The oom badness
> > heuristic rewrite has triggered quite some complains AFAIR (I remember
> > Kosaki has made several attempts to revert it). I think that we are
> > trying to be more careful about user visible changes than we used to be.
> >
> > More importantly I do not think that the current (non-memcg aware) OOM
> > policy is somehow obsolete and many people expect it to behave
> > consistently. As I've said already, I have seen many complains that the
> > OOM killer doesn't kill the right task. Most of them were just NUMA
> > related issues where the oom report was not clear enough. I do not want
> > to repeat that again now. Memcg awareness is certainly a useful
> > heuristic but I do not see it universally applicable to all workloads.
> >
> > > Let's not make the userspace interface crap because of some misguided
> > > idea that the OOM heuristic is a hard promise to userspace. It's never
> > > been, and nobody has complained about changes in the past.
> > >
> > > This case is doubly silly, as the behavior change only applies to
> > > cgroup2, which doesn't exactly have a large base of legacy users yet.
> >
> > I agree on the interface part but I disagree with making it default just
> > because v2 is not largerly adopted yet.
> I believe that the only real regression can be caused by active using of
> oom_score_adj. I really don't know how many cgroup v2 users are relying
> on it (hopefully, 0).

Not only. A memcg with many small tasks could regress as well.

> So, personally I would prefer to have an opt-out cgroup v2 mount option
> (sane new behavior for most users, 100% backward compatibility for rare
> strange setups), but I don't have a very strong opinion here.

I fail to see why should people disable the feature after they see an
unexpected behavior rather than other way around when the feature is
enabled when it is really wanted. The opt-in is more correct just from
the "least surprise POV".
Michal Hocko