[rfc] superblock shrinker accumulating excessive deferred counts

From: David Rientjes
Date: Wed Jul 12 2017 - 16:42:44 EST

Hi Al and everyone,

We're encountering an issue where the per-shrinker per-node deferred
counts grow excessively large for the superblock shrinker. This appears
to be long-standing behavior, so reaching out to you to see if there's any
subtleties being overlooked since there is a reference to memory pressure
and GFP_NOFS allocations growing total_scan purposefully.

This is a side effect of super_cache_count() returning the appropriate
count but super_cache_scan() refusing to do anything about it and
immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.

An unlucky thread will grab the per-node shrinker->nr_deferred[nid] count
and increase it by

(2 * nr_scanned * super_cache_count()) / (nr_eligible + 1)

While total_scan is capped to a sane limit, and restricts the amount of
scanning that this thread actually does, if super_cache_scan() immediately
responds with SHRINK_STOP because of GFP_NOFS, the end result of doing any
of this is that nr_deferred just increased. If we have a burst of
GFP_NOFS allocations, this grows it potentially very largely, which we
have seen in practice, and no matter how much __GFP_FS scanning is done
capped by total_scan, we can never fully get down to batch_count == 1024.

This seems troublesome to me and my first inclination was to avoid
counting *any* objects at all for GFP_NOFS but then I notice the comment
in do_shrink_slab():

* We need to avoid excessive windup on filesystem shrinkers
* due to large numbers of GFP_NOFS allocations causing the
* shrinkers to return -1 all the time. This results in a large
* nr being built up so when a shrink that can do some work
* comes along it empties the entire cache due to nr >>>
* freeable. This is bad for sustaining a working set in
* memory.
* Hence only allow the shrinker to scan the entire cache when
* a large delta change is calculated directly.

I assume the comment is referring to "excessive windup" only in terms of
total_scan, although it doesn't impact next_deferred at all. The problem
here seems to be next_deferred always grows extremely large.

I'd like to do this, but am checking for anything subtle that this relies
on wrt memory pressure or implict intended behavior.

Thanks for looking at this!
fs/super.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
--- a/fs/super.c
+++ b/fs/super.c
@@ -65,13 +65,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,

sb = container_of(shrink, struct super_block, s_shrink);

- /*
- * Deadlock avoidance. We may hold various FS locks, and we don't want
- * to recurse into the FS that called us in clear_inode() and friends..
- */
- if (!(sc->gfp_mask & __GFP_FS))
- return SHRINK_STOP;
if (!trylock_super(sb))

@@ -116,6 +109,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
struct super_block *sb;
long total_objects = 0;

+ /*
+ * Deadlock avoidance. We may hold various FS locks, and we don't want
+ * to recurse into the FS that called us in clear_inode() and friends..
+ */
+ if (!(sc->gfp_mask & __GFP_FS))
+ return 0;
sb = container_of(shrink, struct super_block, s_shrink);