Re: [RFC PATCH 0/2] Remove shrinker's nr_deferred

From: Dave Chinner
Date: Wed Sep 16 2020 - 22:37:52 EST


On Wed, Sep 16, 2020 at 11:58:21AM -0700, Yang Shi wrote:
>
> Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> it turned out there were huge amount accumulated nr_deferred objects seen by the
> shrinker.
>
> I managed to reproduce this problem with kernel build workload plus negative dentry
> generator.
>
> First step, run the below kernel build test script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> cd /root/Buildarea/linux-stable
>
> for i in `seq 1500`; do
> cgcreate -g memory:kern_build
> echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
>
> echo 3 > /proc/sys/vm/drop_caches
> cgexec -g memory:kern_build make clean > /dev/null 2>&1
> cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
>
> cgdelete -g memory:kern_build
> done
>
> That would generate huge amount deferred objects due to __GFP_NOFS allocations.
>
> Then run the below negative dentry generator script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> mkdir /sys/fs/cgroup/memory/test
> echo $$ > /sys/fs/cgroup/memory/test/tasks
>
> for i in `seq $NR_CPUS`; do
> while true; do
> FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
> cat $FILE 2>/dev/null
> done &
> done
>
> Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> showed:
>
> kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928


You have 93M dentries and inodes in the cache, and the reclaim delta is 45746,
which is totally sane for a priority 12 reclaim priority. SO you've
basically had to do a couple of million GFP_NOFS direct reclaim
passes that were unable to reclaim anything to get to a point
where the deferred reclaim would up to 4.9 -billion- objects.

Basically, you would up the deferred work so far that it got out of
control before a GFP_KERNEL reclaim context could do anything to
bring it under control.

However, removing defered work is not the solution. If we don't
defer some of this reclaim work, then filesystem intensive workloads
-cannot reclaim memory from their own caches- when they need memory.
And when those caches largely dominate the used memory in the
machine, this will grind the filesystem workload to a halt.. Hence
this deferral mechanism is actually critical to keeping the
filesystem caches balanced with the rest of the system.

The behaviour you see is the windup clamping code triggering:

/*
* We need to avoid excessive windup on filesystem shrinkers
* due to large numbers of GFP_NOFS allocations causing the
* shrinkers to return -1 all the time. This results in a large
* nr being built up so when a shrink that can do some work
* comes along it empties the entire cache due to nr >>>
* freeable. This is bad for sustaining a working set in
* memory.
*
* Hence only allow the shrinker to scan the entire cache when
* a large delta change is calculated directly.
*/
if (delta < freeable / 4)
total_scan = min(total_scan, freeable / 2);

It clamps the worst case freeing to half the cache, and that is
exactly what you are seeing. This, unfortunately, won't be enough to
fix the windup problem once it's spiralled out of control. It's
fairly rare for this to happen - it takes effort to find an adverse
workload that will cause windup like this.

So, with all that said, a year ago I actually fixed this problem
as part of some work I did to provide non-blocking inode reclaim
infrastructure in the shrinker for XFS inode reclaim.
See this patch:

https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@xxxxxxxxxxxxx/

It did two things. First it ensured all the deferred work was done
by kswapd so that some poor direct reclaim victim didn't hit a
massive reclaim latency spike because of windup. Secondly, it
clamped the maximum windup to the maximum single pass reclaim scan
limit, which is (freeable * 2) objects.

Finally it also changed the amount of deferred work a single kswapd
pass did to be directly proportional to the reclaim priority. Hence
as we get closer to OOM, kswapd tries much harder to get the
deferred work backlog down to zero. This means that a single, low
priority reclaim pass will never reclaim half the cache - only
sustained memory pressure and _reclaim priority windup_ will do
that.

You probably want to look at all the shrinker infrastructure patches
in that series as the deferred work tracking and accounting changes
span a few patches in the series:

https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@xxxxxxxxxxxxx/

Unfortunately, none of the MM developers showed any interest in
these patches, so when I found a different solution to the XFS
problem it got dropped on the ground.

> So why do we have to still keep it around?

Because we need a feedback mechanism to allow us to maintain control
of the size of filesystem caches that grow via GFP_NOFS allocations.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx