Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high

From: Johannes Weiner
Date: Wed Feb 19 2020 - 14:16:24 EST

On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > We have received regression reports from users whose workloads moved
> > into containers and subsequently encountered new latencies. For some
> > users these were a nuisance, but for some it meant missing their SLA
> > response times. We tracked those delays down to cgroup limits, which
> > inject direct reclaim stalls into the workload where previously all
> > reclaim was handled my kswapd.
> I am curious why is this unexpected when the high limit is explicitly
> documented as a throttling mechanism.

Memory.high is supposed to curb aggressive growth using throttling
instead of OOM killing. However, if the workload has plenty of easily
reclaimable memory and just needs to recycle a couple of cache pages
to permit an allocation, there is no need to throttle the workload -
just as there wouldn't be any need to trigger the OOM killer.

So it's not unexpected, but it's unnecessarily heavy-handed: since
memory.high allows some flexibility around the target size, we can
move the routine reclaim activity (cache recycling) out of the main
execution stream of the workload, just like we do with kswapd. If that
cannot keep up, we can throttle and do direct reclaim.

It doesn't change the memory.high semantics, but it allows exploiting
the fact that we have SMP systems and can parallize the book keeping.

> > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > while keeping direct reclaim as a fallback. In our testing, this
> > eliminated all direct reclaim from the affected workload.
> Who is accounted for all the work? Unless I am missing something this
> just gets hidden in the system activity and that might hurt the
> isolation. I do see how moving the work to a different context is
> desirable but this work has to be accounted properly when it is going to
> become a normal mode of operation (rather than a rare exception like the
> existing irq context handling).

Yes, the plan is to account it to the cgroup on whose behalf we're
doing the work.

The problem is that we have a general lack of usable CPU control right
now - see Rik's work on this:
For workloads that are contended on CPU, we cannot enable the CPU
controller because the scheduling latencies are too high. And for
workloads that aren't CPU contended, well, it doesn't really matter
where the reclaim cycles are accounted to.

Once we have the CPU controller up to speed, we can add annotations
like these to account stretches of execution to specific
cgroups. There just isn't much point to do it before we can actually
enable CPU control on the real workloads where it would matter.

[ This is generally work in process: for example, if you isolate
workloads with memory.low, kswapd cpu time isn't accounted to the
cgroup that causes it. Swap IO issued by kswapd isn't accounted to
the group that is getting swapped. The memory consumed by struct
cgroup itself, the percpu allocations for the vmstat arrays etc.,
which is sizable, are not accounted to the cgroup that creates
subgroups, and so forth. ]