Re: [rfc] superblock shrinker accumulating excessive deferred counts

From: David Rientjes
Date: Tue Jul 18 2017 - 20:28:20 EST


On Tue, 18 Jul 2017, Dave Chinner wrote:

> > Thanks for looking into this, Dave!
> >
> > The number of GFP_NOFS allocations that build up the deferred counts can
> > be unbounded, however, so this can become excessive, and the oom killer
> > will not kill any processes in this context. Although the motivation to
> > do additional reclaim because of past GFP_NOFS reclaim attempts is
> > worthwhile, I think it should be limited because currently it only
> > increases until something is able to start draining these excess counts.
>
> Usually kswapd is kicked in by this point and starts doing work. Why
> isn't kswapd doing the shrinker work in the background?
>

It is, and often gets preempted itself while in lru scanning or
shrink_slab(), most often super_cache_count() itself. The issue is that
it gets preempted by networking packets being sent in irq context which
ends up eating up GFP_ATOMIC memory. One of the key traits of this is
that per-zone free memory is far below the min watermarks so not only is
there insufficient memory for GFP_NOFS, but also insufficient memory for
GFP_ATOMIC. Kswapd will only slab shrink a proportion of the lru scanned
if it is not lucky enough to grab the excess nr_deferred. And meanwhile
other threads end up increasing it.

It's various workloads and I can't show a specific example of GFP_NOFS
allocations in flight because we have made changes to prevent this,
specifically ignoring nr_deferred counts for SHRINKER_MEMCG_AWARE
shrinkers since they are largely erroneous. This can also occur if we
cannot grab the trylock on the superblock itself.

> Ugh. The per-node lru list count was designed to run unlocked and so
> avoid this sort of (known) scalability problem.
>
> Ah, see the difference between list_lru_count_node() and
> list_lru_count_one(). list_lru_count_one() should only take locks
> for memcg lookups if it is trying to shrink a memcg. That needs to
> be fixed before anything else and, if possible, the memcg lookup be
> made lockless....
>

We've done that as part of this fix, actually, by avoiding doing resizing
of these list_lru's when the number of memcg cache ids increase. We just
preallocate the max amount, MEMCG_CACHES_MAX_SIZE, to do lockless reads
since the lock there is only needed to prevent concurrent remapping.

> Yup, the memcg shrinking was shoe-horned into the per-node LRU
> infrastructure, and the high level accounting is completely unaware
> of the fact that memcgs have their own private LRUs. We left the
> windup in place because slab caches are shared, and it's possible
> that memory can't be freed because pages have objects from different
> memcgs pinning them. Hence we need to bleed at least some of that
> "we can't make progress" count back into the global "deferred
> reclaim" pool to get other contexts to do some reclaim.
>

Right, now we've patched our kernel to avoid looking at the nr_deferred
count for SHRINKER_MEMCG_AWARE but that's obviously a short-term solution,
and I'm not sure that we can spare the tax to get per-memcg per-node
deferred counts. It seems that some other metadata would be needed in
this case to indicate excessive windup for slab shrinking that cannot
actually do any scanning in super_cache_scan().

Vladimir, do you have a suggestion, or is there someone else that is
working on this?