Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control

From: Michal Hocko
Date: Wed Sep 09 2020 - 11:14:57 EST


[Sorry, this slipped through cracks]

On Mon 24-08-20 12:58:50, Johannes Weiner wrote:
> On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
[...]
> > Arguably seeing the rate drop to near 0 is a very good point to consider
> > running cgroup-OOM.
>
> Agreed. In the past, that's actually what we did: In cgroup1, you
> could disable the kernel OOM killer, and when reclaim failed at the
> limit, the allocating task would be put on a waitqueue until woken up
> by a freeing event. Conceptually this is clean & straight-forward.
>
> However,
>
> 1. Putting allocation contexts with unknown locks to indefinite sleep
> caused deadlocks, for obvious reasons. Userspace OOM killing tends
> to take a lot of task-specific locks when scanning through /proc
> files for kill candidates, and can easily get stuck.
>
> Using bounded over indefinite waits is simply acknowledging that
> the deadlock potential when connecting arbitrary task stacks in the
> system through free->alloc ordering is equally difficult to plan
> out as alloc->free ordering.
>
> The non-cgroup OOM killer actually has the same deadlock potential,
> where the allocating/killing task can hold resources that the OOM
> victim requires to exit. The OOM reaper hides it, the static
> emergency reserves hide it - but to truly solve this problem, you
> would have to have full knowledge of memory & lock ordering
> dependencies of those tasks. And then can still end up with
> scenarios where the only answer is panic().

Yes. Even killing all eligible tasks is not guaranteed to help the
situation because a) resources might be not bound to a process life time
(e.g. tmpfs) or ineligible task might be holding resources that block
others to do the proper cleanup. OOM reaper is here to make sure we
reclaim some of the address space of the victim and we go over all
eligible tasks rather than getting stuck at the first victim for ever.

> 2. I don't recall ever seeing situations in cgroup1 where the precise
> matching of allocation rate to freeing rate has allowed cgroups to
> run sustainably after reclaim has failed. The practical benefit of
> a complicated feedback loop over something crude & robust once
> we're in an OOM situation is not apparent to me.

Yes, this is usually go OOM and kill something. Running on a very edge
of the (memcg) oom doesn't tend to be sustainable and I am not sure it
makes sense to optimize for.

> [ That's different from the IO-throttling *while still doing
> reclaim* that Dave brought up. *That* justifies the same effort
> we put into dirty throttling. I'm only talking about the
> situation where reclaim has already failed and we need to
> facilitate userspace OOM handling. ]
>
> So that was the motivation for the bounded sleeps. They do not
> guarantee containment, but they provide a reasonable amount of time
> for the userspace OOM handler to intervene, without deadlocking.

Yes, memory.high is mostly a best effort containment. We do have the
hard limit to put a stop on runaways or if you are watching for PSI then
the high limit throttling would give you enough idea to take an action
from the userspace.

> That all being said, the semantics of the new 'high' limit in cgroup2
> have allowed us to move reclaim/limit enforcement out of the
> allocation context and into the userspace return path.
>
> See the call to mem_cgroup_handle_over_high() from
> tracehook_notify_resume(), and the comments in try_charge() around
> set_notify_resume().
>
> This already solves the free->alloc ordering problem by allowing the
> allocation to exceed the limit temporarily until at least all locks
> are dropped, we know we can sleep etc., before performing enforcement.
>
> That means we may not need the timed sleeps anymore for that purpose,
> and could bring back directed waits for freeing-events again.
>
> What do you think? Any hazards around indefinite sleeps in that resume
> path? It's called before __rseq_handle_notify_resume and the
> arch-specific resume callback (which appears to be a no-op currently).
>
> Chris, Michal, what are your thoughts? It would certainly be simpler
> conceptually on the memcg side.

I would need a more specific description. But as I've already said. It
doesn't seem that we are in a need to fix any practical problem here.
High limit implementation has changed quite a lot recently. I would
rather see it settled for a while and see how it behaves in wider
variety of workloads before changing the implementation again.

--
Michal Hocko
SUSE Labs