Re: [rfc] superblock shrinker accumulating excessive deferred counts
From: Dave Chinner
Date: Tue Jul 18 2017 - 21:33:57 EST
On Tue, Jul 18, 2017 at 05:28:14PM -0700, David Rientjes wrote:
> On Tue, 18 Jul 2017, Dave Chinner wrote:
>
> > > Thanks for looking into this, Dave!
> > >
> > > The number of GFP_NOFS allocations that build up the deferred counts can
> > > be unbounded, however, so this can become excessive, and the oom killer
> > > will not kill any processes in this context. Although the motivation to
> > > do additional reclaim because of past GFP_NOFS reclaim attempts is
> > > worthwhile, I think it should be limited because currently it only
> > > increases until something is able to start draining these excess counts.
> >
> > Usually kswapd is kicked in by this point and starts doing work. Why
> > isn't kswapd doing the shrinker work in the background?
> >
>
> It is, and often gets preempted itself while in lru scanning or
> shrink_slab(), most often super_cache_count() itself. The issue is that
> it gets preempted by networking packets being sent in irq context which
> ends up eating up GFP_ATOMIC memory.
That seems like a separate architectural problem - memory allocation
threads preempting the memory reclaim threads they depend on for
progress seems like a more general priority inversion problem to me,
not a shrinker problem. It's almost impossible to work around this
sort of "supply can't keep up with demand because demand has higher
priority and starves supply" problem by hacking around in the supply
context...
> One of the key traits of this is
> that per-zone free memory is far below the min watermarks so not only is
> there insufficient memory for GFP_NOFS, but also insufficient memory for
> GFP_ATOMIC. Kswapd will only slab shrink a proportion of the lru scanned
> if it is not lucky enough to grab the excess nr_deferred. And meanwhile
> other threads end up increasing it.
It sounds very much like GFP_KERNEL kswapd reclaim context needs to
run with higher priority than the network driver ISR threads. Or, if
the drivers actually do large amounts of memory allocation in IRQ
context, then that work needs to be moved into ISRs that can be
scheduled appropriately to prevent starvation of memory reclaim.
i.e. the network drivers should be dropping packets because they
can't get memory, not per-empting reclaim infrastructure in an
attempt to get more memory allocated because packets are incoming...
> It's various workloads and I can't show a specific example of GFP_NOFS
> allocations in flight because we have made changes to prevent this,
> specifically ignoring nr_deferred counts for SHRINKER_MEMCG_AWARE
> shrinkers since they are largely erroneous. This can also occur if we
> cannot grab the trylock on the superblock itself.
Which should be pretty rare.
>
> > Ugh. The per-node lru list count was designed to run unlocked and so
> > avoid this sort of (known) scalability problem.
> >
> > Ah, see the difference between list_lru_count_node() and
> > list_lru_count_one(). list_lru_count_one() should only take locks
> > for memcg lookups if it is trying to shrink a memcg. That needs to
> > be fixed before anything else and, if possible, the memcg lookup be
> > made lockless....
> >
>
> We've done that as part of this fix, actually, by avoiding doing resizing
> of these list_lru's when the number of memcg cache ids increase. We just
> preallocate the max amount, MEMCG_CACHES_MAX_SIZE, to do lockless reads
> since the lock there is only needed to prevent concurrent remapping.
And if you've fixed this, why is the system getting stuck counting
the number of objects on the LRU? Or does that just move the
serialisation to the scan call itself?
If so, I suspect this is going to be another case of direct reclaim
trying to drive unbound parallelism through the shrinkers which
don't have any parallelism at all because the caches being shrunk
only have a single list in memcg contexts. There's nothing quite
like having a thundering heard of allocations all trying to run
direct reclaim at the same time and getting backed up in the same
shrinker context because the shrinker effectively serialises access
to the cache....
> > Yup, the memcg shrinking was shoe-horned into the per-node LRU
> > infrastructure, and the high level accounting is completely unaware
> > of the fact that memcgs have their own private LRUs. We left the
> > windup in place because slab caches are shared, and it's possible
> > that memory can't be freed because pages have objects from different
> > memcgs pinning them. Hence we need to bleed at least some of that
> > "we can't make progress" count back into the global "deferred
> > reclaim" pool to get other contexts to do some reclaim.
> >
>
> Right, now we've patched our kernel to avoid looking at the nr_deferred
> count for SHRINKER_MEMCG_AWARE but that's obviously a short-term solution,
> and I'm not sure that we can spare the tax to get per-memcg per-node
> deferred counts.
I very much doubt it - it was too expensive to even consider a few
years ago and the cost hasn't gone down at all...
But, really, what I'm hearing at the moment is that the shrinker
issues are ony a symptom of a deeper architectural problem and not
the cause. It sounds to me like it's simply a case of severe
demand-driven breakdown because the GFP_KERNEL memory reclaim
mechanisms are being starved of CPU time by allocation contexts that
can't do direct reclaim. That problem needs to be solved first, then
we can look at what happens when GFP_KERNEL reclaim contexts are
given the CPU time they need to keep up with interrupt context
GFP_ATOMIC allocation demand....
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx