Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control

From: Johannes Weiner
Date: Mon Aug 24 2020 - 13:01:09 EST


On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote:
> > On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@xxxxxxxxxxxxx wrote:
> > > What you need is a feeback loop against the rate of freeing pages, and
> > > when you near the saturation point, the allocation rate should exactly
> > > match the freeing rate.
> >
> > IO throttling solves a slightly different problem.
> >
> > IO occurs in parallel to the workload's execution stream, and you're
> > trying to take the workload from dirtying at CPU speed to rate match
> > to the independent IO stream.
> >
> > With memory allocations, though, freeing happens from inside the
> > execution stream of the workload. If you throttle allocations, you're
>
> For a single task, but even then you're making the argument that we need
> to allocate memory to free memory, and we all know where that gets us.
>
> But we're actually talking about a cgroup here, which is a collection of
> tasks all doing things in parallel.

Right, but sharing a memory cgroup means sharing an LRU list, and that
transfers memory pressure and allocation burden between otherwise
independent tasks - if nothing else through cache misses on the
executables and libraries. I doubt that one task can go through
several comprehensive reclaim cycles on a shared LRU without
completely annihilating the latency or throughput targets of everybody
else in the group in most real world applications.

> > most likely throttling the freeing rate as well. And you'll slow down
> > reclaim scanning by the same amount as the page references, so it's
> > not making reclaim more successful either. The alloc/use/free
> > (im)balance is an inherent property of the workload, regardless of the
> > speed you're executing it at.
>
> Arguably seeing the rate drop to near 0 is a very good point to consider
> running cgroup-OOM.

Agreed. In the past, that's actually what we did: In cgroup1, you
could disable the kernel OOM killer, and when reclaim failed at the
limit, the allocating task would be put on a waitqueue until woken up
by a freeing event. Conceptually this is clean & straight-forward.

However,

1. Putting allocation contexts with unknown locks to indefinite sleep
caused deadlocks, for obvious reasons. Userspace OOM killing tends
to take a lot of task-specific locks when scanning through /proc
files for kill candidates, and can easily get stuck.

Using bounded over indefinite waits is simply acknowledging that
the deadlock potential when connecting arbitrary task stacks in the
system through free->alloc ordering is equally difficult to plan
out as alloc->free ordering.

The non-cgroup OOM killer actually has the same deadlock potential,
where the allocating/killing task can hold resources that the OOM
victim requires to exit. The OOM reaper hides it, the static
emergency reserves hide it - but to truly solve this problem, you
would have to have full knowledge of memory & lock ordering
dependencies of those tasks. And then can still end up with
scenarios where the only answer is panic().

2. I don't recall ever seeing situations in cgroup1 where the precise
matching of allocation rate to freeing rate has allowed cgroups to
run sustainably after reclaim has failed. The practical benefit of
a complicated feedback loop over something crude & robust once
we're in an OOM situation is not apparent to me.

[ That's different from the IO-throttling *while still doing
reclaim* that Dave brought up. *That* justifies the same effort
we put into dirty throttling. I'm only talking about the
situation where reclaim has already failed and we need to
facilitate userspace OOM handling. ]

So that was the motivation for the bounded sleeps. They do not
guarantee containment, but they provide a reasonable amount of time
for the userspace OOM handler to intervene, without deadlocking.


That all being said, the semantics of the new 'high' limit in cgroup2
have allowed us to move reclaim/limit enforcement out of the
allocation context and into the userspace return path.

See the call to mem_cgroup_handle_over_high() from
tracehook_notify_resume(), and the comments in try_charge() around
set_notify_resume().

This already solves the free->alloc ordering problem by allowing the
allocation to exceed the limit temporarily until at least all locks
are dropped, we know we can sleep etc., before performing enforcement.

That means we may not need the timed sleeps anymore for that purpose,
and could bring back directed waits for freeing-events again.

What do you think? Any hazards around indefinite sleeps in that resume
path? It's called before __rseq_handle_notify_resume and the
arch-specific resume callback (which appears to be a no-op currently).

Chris, Michal, what are your thoughts? It would certainly be simpler
conceptually on the memcg side.