Re: [RFC PATCH 0/3] Make deferred split shrinker memcg aware

From: Yang Shi
Date: Tue May 28 2019 - 22:37:57 EST




On 5/29/19 9:22 AM, David Rientjes wrote:
On Tue, 28 May 2019, Yang Shi wrote:

I got some reports from our internal application team about memcg OOM.
Even though the application has been killed by oom killer, there are
still a lot THPs reside, page reclaim doesn't reclaim them at all.

Some investigation shows they are on deferred split queue, memcg direct
reclaim can't shrink them since THP deferred split shrinker is not memcg
aware, this may cause premature OOM in memcg. The issue can be
reproduced easily by the below test:

Right, we've also encountered this. I talked to Kirill about it a week or
so ago where the suggestion was to split all compound pages on the
deferred split queues under the presence of even memory pressure.

That breaks cgroup isolation and perhaps unfairly penalizes workloads that
are running attached to other memcg hierarchies that are not under
pressure because their compound pages are now split as a side effect.
There is a benefit to keeping these compound pages around while not under
memory pressure if all pages are subsequently mapped again.

Yes, I do agree. I tried other approaches too, it sounds making deferred split queue per memcg is the optimal one.


$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp ./transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Yes, we have seen this on at least 4.15 as well.

Convert deferred split shrinker memcg aware by introducing per memcg deferred
split queue. The THP should be on either per node or per memcg deferred
split queue if it belongs to a memcg. When the page is immigrated to the
other memcg, it will be immigrated to the target memcg's deferred split queue
too.

And, move deleting THP from deferred split queue in page free before memcg
uncharge so that the page's memcg information is available.

Reuse the second tail page's deferred_list for per memcg list since the same
THP can't be on multiple deferred split queues at the same time.

Remove THP specific destructor since it is not used anymore with memcg aware
THP shrinker (Please see the commit log of patch 2/3 for the details).

Make deferred split shrinker not depend on memcg kmem since it is not slab.
It doesn't make sense to not shrink THP even though memcg kmem is disabled.

With the above change the test demonstrated above doesn't trigger OOM anymore
even though with cgroup.memory=nokmem.

I'm curious if your internal applications team is also asking for
statistics on how much memory can be freed if the deferred split queues
can be shrunk? We have applications that monitor their own memory usage

No, but this reminds me. The THPs on deferred split queue should be accounted into available memory too.

through memcg stats or usage and proactively try to reduce that usage when
it is growing too large. The deferred split queues have significantly
increased both memcg usage and rss when they've upgraded kernels.

How are your applications monitoring how much memory from deferred split
queues can be freed on memory pressure? Any thoughts on providing it as a
memcg stat?

I don't think they have such monitor. I saw rss_huge is abormal in memcg stat even after the application is killed by oom, so I realized the deferred split queue may play a role here.

The memcg stat doesn't have counters for available memory as global vmstat. It may be better to have such statistics, or extending reclaimable "slab" to shrinkable/reclaimable "memory".


Thanks!