Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

From: David Rientjes
Date: Thu Oct 05 2017 - 17:53:10 EST


On Thu, 5 Oct 2017, Johannes Weiner wrote:

> > It is, because it can quite clearly be a DoSand was prevented with
> > Roman's earlier design of iterating usage up the hierarchy and comparing
> > siblings based on that criteria. I know exactly why he chose that
> > implementation detail early on, and it was to prevent cases such as this
> > and to not let userspace hide from the oom killer.
>
> This doesn't address how it's different from a single process
> following the same pattern right now.
>

Are you referring to a single process being rewritten into N different
subprocesses that do the same work as the single process but is separated
in this manner to avoid having a large rss for any single process to avoid
being oom killed?

This is solved by a cgroup-aware oom killer because these subprocesses
should not be able to escape their own chargable entity. It's exactly the
usecase that Roman is addressing, correct? My suggestion is to continue
to iterate the usage up the hierarchy so that users can't easily defeat
this by creating N subcontainers instead.

> > Let's resolve that global oom is a real condition and getting into that
> > situation is not a userspace problem. It's the result of overcommiting
> > the system, and is used in the enterprise to address business goals. If
> > the above is true, and its up to memcg to prevent global oom in the first
> > place, then this entire patchset is absolutely pointless. Limit userspace
> > to 95% of memory and when usage is approaching that limit, let userspace
> > attached to the root memcg iterate the hierarchy itself and kill from the
> > largest consumer.
> >
> > This patchset exists because overcommit is real, exactly the same as
> > overcommit within memcg hierarchies is real. 99% of the time we don't run
> > into global oom because people aren't using their limits so it just works
> > out. 1% of the time we run into global oom and we need a decision to made
> > based for forward progress. Using Michal's earlier example of admins and
> > students, a student can easily use all of his limit and also, with v10 of
> > this patchset, 99% of the time avoid being oom killed just by forking N
> > processes over N cgroups. It's going to oom kill an admin every single
> > time.
>
> We overcommit too, but our workloads organize themselves based on
> managing their resources, not based on evading the OOM killer. I'd
> wager that's true for many if not most users.
>

No workloads are based on evading the oom killer, we are specifically
trying to avoid that with oom priorities. They have the power over
increasing their own priority to be preferred to kill, but not decreasing
their oom priority that was set by an activity manager. This is exactly
the same as how /proc/pid/oom_score_adj works. With a cgroup-aware oom
killer, which we'd love, nothing can possibly evade the oom killer if
there are oom priorities.