Re: [rfc] superblock shrinker accumulating excessive deferred counts
From: David Rientjes
Date: Mon Jul 17 2017 - 16:37:45 EST
On Mon, 17 Jul 2017, Dave Chinner wrote:
> > This is a side effect of super_cache_count() returning the appropriate
> > count but super_cache_scan() refusing to do anything about it and
> > immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.
>
> Yup. Happens during things like memory allocations in filesystem
> transaction context. e.g. when your memory pressure is generated by
> GFP_NOFS allocations within transactions whilst doing directory
> traversals (say 'chown -R' across an entire filesystem), then we
> can't do direct reclaim on the caches that are generating the memory
> pressure and so have to defer all the work to either kswapd or the
> next GFP_KERNEL allocation context that triggers reclaim.
>
Thanks for looking into this, Dave!
The number of GFP_NOFS allocations that build up the deferred counts can
be unbounded, however, so this can become excessive, and the oom killer
will not kill any processes in this context. Although the motivation to
do additional reclaim because of past GFP_NOFS reclaim attempts is
worthwhile, I think it should be limited because currently it only
increases until something is able to start draining these excess counts.
Having 10,000 GFP_NOFS reclaim attempts store up
(2 * nr_scanned * freeable) / (nr_eligible + 1) objects 10,000 times
such that it exceeds freeable by many magnitudes doesn't seem like a
particularly useful thing. For reference, we have seen nr_deferred for a
single node to be > 10,000,000,000 in practice. total_scan is limited to
2 * freeable for each call to do_shrink_slab(), but such an excessive
deferred count will guarantee it retries 2 * freeable each time instead of
the proportion of lru scanned as intended.
What breaks if we limit the nr_deferred counts to freeable * 4, for
example?
> > and no matter how much __GFP_FS scanning is done
> > capped by total_scan, we can never fully get down to batch_count == 1024.
>
> I don't see a batch_count variable in the shrinker code anywhere,
> so I'm not sure what you mean by this.
>
batch_size == 1024, sorry.
> Can you post a shrinker trace that shows the deferred count wind
> up and then display the problem you're trying to describe?
>
All threads contending on the list_lru's nlru->lock because they are all
stuck in super_cache_count() while one thread is iterating through an
excessive number of deferred objects in super_cache_scan(), contending for
the same locks and nr_deferred never substantially goes down.
The problem with the superblock shrinker, which is why I emailed Al
originally, is also that it is SHRINKER_MEMCG_AWARE. Our
list_lru_shrink_count() is only representative for the list_lru of
sc->memcg, which is used in both super_cache_count() and
super_cache_scan() for various math. The nr_deferred counts from the
do_shrink_slab() logic, however, are per-nid and, as such, various memcgs
get penalized with excessive counts that they do not have freeable to
begin with.